gaborgsomogyi commented on pull request #28422:
URL: https://github.com/apache/spark/pull/28422#issuecomment-643271165


   I've analyzed this further. I have the same opinion about `maxFileAge`, 
namely it's unintuitive how it's programmed. I think it should be like:
   * `maxFileAge` should behave like `inputRetention`. Retention is based on 
current timestamp normally. We shouldn't go far, Kafka and similar components 
does that.
   * The current feature should depend on `maxFileAge`
   
   If the user wants to operate a query with `latestFirst` in long term then I 
see these options:
   * User sets `maxFileAge` properly => no file loss just some fluctuation in 
the number of not processed files
   * User doesn't set `maxFileAge` properly but cluster sized properly => 
configuration issue, because with proper value all the files must be processed 
within `maxFileAge`.
   * User doesn't set `maxFileAge` properly and cluster sized badly => sizing 
and configuration issue. Cluster computation power must be increased to have 
room for the old not yet processed files. As in the previous case choosing 
appropriate `maxFileAge` is important.
   
   The last point can be problematic and can end-up in data loss but this is 
exactly the same when processing data from Kafka. If retention fires then the 
data just disappear w/o any notification. This situation is better though 
because if the query is not able to catch-up then it can be restarted with 
bigger `maxFileAge` and cluster, allowing the query to catch up properly.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to