[GitHub] HeartSaVioR commented on issue #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata

GitBox Tue, 19 Feb 2019 19:19:47 -0800

HeartSaVioR commented on issue #23840: [SPARK-24295][SS] Add option to retain 
only last batch in file stream sink metadata
URL: https://github.com/apache/spark/pull/23840#issuecomment-465405043
 
 
   In practice, end users would have policy for data retention, and output 
files could be removed based on the policy. So it would be ideal if metadata 
can be reflected on the change of output files, but in point of Spark's view it 
doesn't look like easy to do. For example, if we go on checking existence of 
files in metadata list periodically (maybe each X batches to avoid concurrent 
modification), it will be another huge overhead to slow down. Specifying 
retention policy in Spark query (which files will be removed outside of Spark) 
is also really odd, so neither is beauty.
   
   If it's OK for file stream sink to periodically check existence of files and 
get rid of removed files in file log, I'll apply the change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] HeartSaVioR commented on issue #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata

Reply via email to