HeartSaVioR commented on issue #22952: [SPARK-20568][SS] Provide option to 
clean up completed files in streaming query
URL: https://github.com/apache/spark/pull/22952#issuecomment-467678800
 
 
   > why do Spark users want to use these options? Is it just a matter of 
controlling storage space without an offline janitor?
   
   I guess we've had earlier comments in origin issue, so please take a look at 
origin issue first. 
[SPARK-20568](https://issues.apache.org/jira/browse/SPARK-20568)
   
   For me it's about data retention policy. We will be very limited when we 
have to store all the source files as they are (data size, as well as metadata 
length - imagine how compaction works), so should have to provide the way to 
purge some of them, but with safe way - source files which will never be 
accessed. I think it cannot be done outside of query, or even it can be, it 
requires really hacky way to read checkpoint/metadata and delete them.
   
   Spark seems to miss on considering on high volume (or so many files) / long 
running streaming query - another example would be metadata growing on both 
file stream source and file stream sink. Spark will compact and purge metadata 
files, but overall list of files cannot be reduced if we don't apply retention. 
Relevant issue is filed in 
[SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295), and reporter 
already took hacky way to get around it.
   
   > There have also been a lot of comments discussed without unit tests 
written to confirm we've resolved the issue.
   
   Could you please point out which things would be? It would be helpful to 
just comment which part(s) UTs don't cover.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to