gaborgsomogyi commented on pull request #28363: URL: https://github.com/apache/spark/pull/28363#issuecomment-625264479
I've read the discussion on https://github.com/apache/spark/pull/24128 and I agree that TTL would be the way. I like the idea for instance how Kafka handles the situation (even if retention generates some confusion on Spark user side when retention deleted data but Spark wanted to process it and not found). I think first the metadata must be compacted (remove file entries where TTL expired) but what I miss is to delete files. There are 2 type of files without this patch: * Name exists in metadata file * Name doesn't exists in metadata file (it's junk) With this change this will be extended with a third one: * Name doesn't exists in metadata file (TTL expired) If we want to do full TTL then a separate GC would be good to delete files matching 2nd and 3rd bullet points (of course only after whne from metadata removed). What I see as a potential problem is that FS timestamp may be different from local time (not yet checked how Hadoop handles time). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
