[ https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672805#comment-16672805 ]
Jungtaek Lim commented on SPARK-20568: -------------------------------------- [~zsxwing] I've thought about it a bit. I'm not familiar with file stream source, but if I'm not missing here, there's no "progressing" state of file: file should be processed in a batch once it is included. So we have two options here: 1. Delete (or move out) files which are included in finished batch files in "sources" directory in checkpoint. 2. Delete (or move out) files which are included in "current" batch when batch is just completed. If we move out files to some directory like "archive", I guess option 2 is safe. Moved files can be moved again to re-run previous batch if end users really want. Actually I haven't heard actual cases which remove some batches in checkpoint directory to rerun previous batch. What do you think about the options? > Delete files after processing in structured streaming > ----------------------------------------------------- > > Key: SPARK-20568 > URL: https://issues.apache.org/jira/browse/SPARK-20568 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming > Affects Versions: 2.1.0, 2.2.1 > Reporter: Saul Shanabrook > Priority: Major > > It would be great to be able to delete files after processing them with > structured streaming. > For example, I am reading in a bunch of JSON files and converting them into > Parquet. If the JSON files are not deleted after they are processed, it > quickly fills up my hard drive. I originally [posted this on Stack > Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to > make a feature request for it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org