[GitHub] [spark] HeartSaVioR edited a comment on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

GitBox Thu, 07 May 2020 20:10:08 -0700


HeartSaVioR edited a comment on pull request #28363:
URL: https://github.com/apache/spark/pull/28363#issuecomment-625605044



   > If we want to do full TTL then a separate GC would be good to delete files 
matching 2nd and 3rd bullet points (of course only after whne from metadata 
removed).
   
   Yeah I didn't deal with this because there may be some reader queries which 
still read from old version of metadata which may contain excluded files. 
(Batch query would read all available files so there's still a chance for race 
condition.)
   
   > What I see as a potential problem is that FS timestamp may be different 
from local time (not yet checked how Hadoop handles time).
   
   While I'm not sure it's a real problem (as we rely on the last modified time 
while reading files), I eliminated the case via adding "commit time" on entry 
and applying retention based on commit time. So I guess the thing is no longer 
valid.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

Reply via email to