gaborgsomogyi commented on pull request #28363: URL: https://github.com/apache/spark/pull/28363#issuecomment-625806486
> Yeah I didn't deal with this because there may be some reader queries which still read from old version of metadata which may contain excluded files. (Batch query would read all available files so there's still a chance for race condition.) That's a valid consideration. Cleaning junk files not necessarily must belong to this feature. This can be put behind another flag. I'm thinking about this for long time (though the initial idea was to delete only the generated junk). Of course this must be done in a separate thread because directory listing can be pathologically slow in some cases. This could reduce the storage cost to users significantly in an automatic way... > While I'm not sure it's a real problem (as we rely on the last modified time while reading files), I eliminated the case via adding "commit time" on entry and applying retention based on commit time. So I guess the thing is no longer valid. I've played with HDFS and read the docs of the other filesystems and haven't found any glithes. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
