HeartSaVioR commented on pull request #28422: URL: https://github.com/apache/spark/pull/28422#issuecomment-641097576
I agree the new addition of the similar option feels tricky. Maybe you've already indicated there're some cases `maxFileAge` has to be ignored which means Spark is never able to drop entries from metadata (e.g. when `latestFirst` is true and `maxFilesPerTrigger` is set). Given all of these options can be changed for the further runs, I was confused whether it'd be safe to drop entries based on the current set of options and status of entries. There looked to be an edge-case input files can be processed more than once. Also I felt it's less intuitive to reason about the way how the max age is specified - it is with respect to the timestamp of the latest file being figured out from Spark, not the timestamp of the current system. (But well... That might be only me.) The new option ensures that the behavior is consistent regardless of these options. It just plays as "hard" limit and in any case Spark won't handle the files which are older than the threshold. (Suppose these files are simply deleted due to the retention policy - not physically though) It applies on both forward read and backward read, doesn't matter how many files Spark will read in a batch. (Personally, I think `maxFileAge` itself should work like the way, and then we wouldn't have such confusion.) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
