[GitHub] [spark] gaborgsomogyi commented on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

GitBox Fri, 08 May 2020 06:08:09 -0700


gaborgsomogyi commented on pull request #28363:
URL: https://github.com/apache/spark/pull/28363#issuecomment-625806486



   > Yeah I didn't deal with this because there may be some reader queries 
which still read from old version of metadata which may contain excluded files. 
(Batch query would read all available files so there's still a chance for race 
condition.)
   
   That's a valid consideration. Cleaning junk files not necessarily must 
belong to this feature. This can be put behind another flag. I'm thinking about 
this for long time (though the initial idea was to delete only the generated 
junk). Of course this must be done in a separate thread because directory 
listing can be pathologically slow in some cases. This could reduce the storage 
cost to users significantly in an automatic way...
   
   > While I'm not sure it's a real problem (as we rely on the last modified 
time while reading files), I eliminated the case via adding "commit time" on 
entry and applying retention based on commit time. So I guess the thing is no 
longer valid.
   
   I've played with HDFS and read the docs of the other filesystems and haven't 
found any glithes.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gaborgsomogyi commented on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

Reply via email to