[
https://issues.apache.org/jira/browse/SPARK-19813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Burak Yavuz resolved SPARK-19813.
---------------------------------
Resolution: Fixed
Fix Version/s: 2.2.0
2.1.1
> maxFilesPerTrigger combo latestFirst may miss old files in combination with
> maxFileAge in FileStreamSource
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-19813
> URL: https://issues.apache.org/jira/browse/SPARK-19813
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 2.1.0
> Reporter: Burak Yavuz
> Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> There is a file stream source option called maxFileAge which limits how old
> the files can be, relative the latest file that has been seen. This is used
> to limit the files that need to be remembered as "processed". Files older
> than the latest processed files are ignored. This values is by default 7 days.
> This causes a problem when both
> - latestFirst = true
> - maxFilesPerTrigger > total files to be processed.
> Here is what happens in all combinations
> 1) latestFirst = false - Since files are processed in order, there wont be
> any unprocessed file older than the latest processed file. All files will be
> processed.
> 2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge
> thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is
> not, then all old files get processed in the first batch, and so no file is
> left behind.
> 3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch
> process the latest X files. That sets the threshold latest file - maxFileAge,
> so files older than this threshold will never be considered for processing.
> The bug is with case 3.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]