[GitHub] [spark] HeartSaVioR commented on pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

GitBox Tue, 09 Jun 2020 08:52:19 -0700


HeartSaVioR commented on pull request #28422:
URL: https://github.com/apache/spark/pull/28422#issuecomment-641097576



   I agree the new addition of the similar option feels tricky. 
   
   Maybe you've already indicated there're some cases `maxFileAge` has to be 
ignored which means Spark is never able to drop entries from metadata (e.g. 
when `latestFirst` is true and `maxFilesPerTrigger` is set). Given all of these 
options can be changed for the further runs, I was confused whether it'd be 
safe to drop entries based on the current set of options and status of entries. 
There looked to be an edge-case input files can be processed more than once.
   
   Also I felt it's less intuitive to reason about the way how the max age is 
specified - it is with respect to the timestamp of the latest file being 
figured out from Spark, not the timestamp of the current system. (But well... 
That might be only me.)
   
   The new option ensures that the behavior is consistent regardless of these 
options. It just plays as "hard" limit and in any case Spark won't handle the 
files which are older than the threshold. (Suppose these files are simply 
deleted due to the retention policy - not physically though) It applies on both 
forward read and backward read, doesn't matter how many files Spark will read 
in a batch.
   (Personally, I think `maxFileAge` itself should work like the way, and then 
we wouldn't have such confusion.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

Reply via email to