HeartSaVioR commented on issue #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata URL: https://github.com/apache/spark/pull/23840#issuecomment-465405043 In practice, end users would have policy for data retention, and output files could be removed based on the policy. So it would be ideal if metadata can be reflected on the change of output files, but in point of Spark's view it doesn't look like easy to do. For example, if we go on checking existence of files in metadata list periodically (maybe each X batches to avoid concurrent modification), it will be another huge overhead to slow down. Specifying retention policy in Spark query (which files will be removed outside of Spark) is also really odd, so neither is beauty. If it's OK for file stream sink to periodically check existence of files and get rid of removed files in file log, I'll apply the change.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
