Burak Yavuz created SPARK-19813:
-----------------------------------
Summary: maxFilesPerTrigger combo latestFirst may miss old files
in combination with maxFileAge in FileStreamSource
Key: SPARK-19813
URL: https://issues.apache.org/jira/browse/SPARK-19813
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Assignee: Burak Yavuz
There is a file stream source option called maxFileAge which limits how old the
files can be, relative the latest file that has been seen. This is used to
limit the files that need to be remembered as "processed". Files older than the
latest processed files are ignored. This values is by default 7 days.
This causes a problem when both
- latestFirst = true
- maxFilesPerTrigger > total files to be processed.
Here is what happens in all combinations
1) latestFirst = false - Since files are processed in order, there wont be any
unprocessed file older than the latest processed file. All files will be
processed.
2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is
not, then all old files get processed in the first batch, and so no file is
left behind.
3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch
process the latest X files. That sets the threshold latest file - maxFileAge,
so files older than this threshold will never be considered for processing.
The bug is with case 3.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]