gaborgsomogyi commented on pull request #28422: URL: https://github.com/apache/spark/pull/28422#issuecomment-643271165
I've analyzed this further. I have the same opinion about `maxFileAge`, namely it's unintuitive how it's programmed. I think it should be like: * `maxFileAge` should behave like `inputRetention`. Retention is based on current timestamp normally. We shouldn't go far, Kafka and similar components does that. * The current feature should depend on `maxFileAge` If the user wants to operate a query with `latestFirst` in long term then I see these options: * User sets `maxFileAge` properly => no file loss just some fluctuation in the number of not processed files * User doesn't set `maxFileAge` properly but cluster sized properly => configuration issue, because with proper value all the files must be processed within `maxFileAge`. * User doesn't set `maxFileAge` properly and cluster sized badly => sizing and configuration issue. Cluster computation power must be increased to have room for the old not yet processed files. As in the previous case choosing appropriate `maxFileAge` is important. The last point can be problematic and can end-up in data loss but this is exactly the same when processing data from Kafka. If retention fires then the data just disappear w/o any notification. This situation is better though because if the query is not able to catch-up then it can be restarted with bigger `maxFileAge` and cluster, allowing the query to catch up properly. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
