[GitHub] [spark] gaborgsomogyi commented on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

GitBox Thu, 07 May 2020 06:44:52 -0700


gaborgsomogyi commented on pull request #28363:
URL: https://github.com/apache/spark/pull/28363#issuecomment-625264479



   I've read the discussion on https://github.com/apache/spark/pull/24128 and I 
agree that TTL would be the way. I like the idea for instance how Kafka handles 
the situation (even if retention generates some confusion on Spark user side 
when retention deleted data but Spark wanted to process it and not found).
   
   I think first the metadata must be compacted (remove file entries where TTL 
expired) but what I miss is to delete files. There are 2 type of files without 
this patch:
   * Name exists in metadata file
   * Name doesn't exists in metadata file (it's junk)
   
   With this change this will be extended with a third one:
   * Name doesn't exists in metadata file (TTL expired)
   
   If we want to do full TTL then a separate GC would be good to delete files 
matching 2nd and 3rd bullet points (of course only after whne from metadata 
removed).
   
   What I see as a potential problem is that FS timestamp may be different from 
local time (not yet checked how Hadoop handles time).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gaborgsomogyi commented on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

Reply via email to