HeartSaVioR opened a new pull request #28363:
URL: https://github.com/apache/spark/pull/28363


   ### What changes were proposed in this pull request?
   
   This patch proposes to provide a new option to specify time-to-live (TTL) 
for output file entries in FileStreamSink. TTL is defined via current timestamp 
- commit time (the time `ManifestFileCommitProtocol.commitJob` is called to 
write streaming file sink metadata log).
   
   This patch will filter out outdated output files in metadata while 
compacting batches (other batches don't have functionality to clean entries), 
which helps metadata to not grow linearly, as well as filtered out files will 
be "eventually" no longer seen in reader queries which leverage 
File(Stream)Source.
   
   ### Why are the changes needed?
   
   The metadata log greatly helps to easily achieve exactly-once but given the 
output path is open to arbitrary readers, there's no way to compact the 
metadata log, which ends up growing the metadata file as query runs for long 
time, especially for compacted batch.
   
   Lots of end users have been reporting the issue: see comments in 
[SPARK-24295](https://issues.apache.org/jira/browse/SPARK-24295) and 
[SPARK-29995](https://issues.apache.org/jira/browse/SPARK-29995), and 
[SPARK-30462](https://issues.apache.org/jira/browse/SPARK-30462).
   (There're some reports from end users which include their workarounds: 
SPARK-24295)
   
   ### Does this PR introduce any user-facing change?
   
   No, as the configuration is new and by default it is not applied.
   
   ### How was this patch tested?
   
   New UT.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to