HeartSaVioR edited a comment on issue #22952: [SPARK-20568][SS] Provide option to clean up completed files in streaming query URL: https://github.com/apache/spark/pull/22952#issuecomment-467678800 > why do Spark users want to use these options? Is it just a matter of controlling storage space without an offline janitor? I guess we've had earlier comments in origin issue, so please take a look at origin issue first. [SPARK-20568](https://issues.apache.org/jira/browse/SPARK-20568) For me it's about data retention policy. We will be very limited when we have to store all the source files as they are (data size, as well as metadata length - imagine how compaction works), so should have to provide the way to purge some of them, but with safe way - source files which will never be accessed. I think it cannot be done outside of query, or even it can be, it requires really hacky way to read checkpoint/metadata and delete the source files and instrument checkpoint. Spark seems to miss on considering on high volume (or so many files) / long running streaming query - another example would be metadata growing on both file stream source and file stream sink. Spark will compact and purge metadata files, but overall list of files cannot be reduced if we don't apply retention. Relevant issue is filed in [SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295), and reporter already took hacky way to get around it. > There have also been a lot of comments discussed without unit tests written to confirm we've resolved the issue. Could you please point out which things would be? It would be helpful to just comment which part(s) UTs don't cover.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
