itsvikramagr edited a comment on pull request #28904:
URL: https://github.com/apache/spark/pull/28904#issuecomment-666248758


   @HeartSaVioR - This is a much-needed fix. Thanks for it.
   
   I have an orthogonal question. Why do we need to worry about file sink 
metadata files? I can think of the following reasons 
   -  the downstream read operations can read the compacted metadata file to 
list all committed files. So they can avoid the listing cost and also improve 
performance
   - Helps in exactly-once semantics. On task failures, we don't have to worry 
about deleting any files written. 
   
   If the compacted metadata file size is running into GBs, the number of valid 
files would be in millions. In practice, the end-user will consider this sink 
path as a staging location and have another job to compact these small files 
into a final destination. 
   
   for exactly-once semantics, we can add make changes in ManifestFileCommitter 
to delete files in the abort function. Or we can come up with some other 
alternatives.
    
   In short, if we provide an option just to have last few commits in sink 
metadata to ensure SS is not impacted. And make changes in various readers not 
to read using metadata log files. Won't it help in ensuring the reliability of 
the streaming job?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to