[ 
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895110#comment-15895110
 ] 

Apache Spark commented on SPARK-19813:
--------------------------------------

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/17153

> maxFilesPerTrigger combo latestFirst may miss old files in combination with 
> maxFileAge in FileStreamSource
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-19813
>                 URL: https://issues.apache.org/jira/browse/SPARK-19813
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.1.0
>            Reporter: Burak Yavuz
>            Assignee: Burak Yavuz
>
> There is a file stream source option called maxFileAge which limits how old 
> the files can be, relative the latest file that has been seen. This is used 
> to limit the files that need to be remembered as "processed". Files older 
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both 
>  - latestFirst = true
>  - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
>  1) latestFirst = false - Since files are processed in order, there wont be 
> any unprocessed file older than the latest processed file. All files will be 
> processed.
>  2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
> not, then all old files get processed in the first batch, and so no file is 
> left behind.
>  3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
> process the latest X files. That sets the threshold latest file - maxFileAge, 
> so files older than this threshold will never be considered for processing. 
> The bug is with case 3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to