GitHub user brkyvz opened a pull request:
https://github.com/apache/spark/pull/17153
[SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in
combination with maxFileAge in FileStreamSource
## What changes were proposed in this pull request?
**The Problem**
There is a file stream source option called maxFileAge which limits how old
the files can be, relative the latest file that has been seen. This is used to
limit the files that need to be remembered as "processed". Files older than the
latest processed files are ignored. This values is by default 7 days.
This causes a problem when both
latestFirst = true
maxFilesPerTrigger > total files to be processed.
Here is what happens in all combinations
1) latestFirst = false - Since files are processed in order, there wont be
any unprocessed file older than the latest processed file. All files will be
processed.
2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is
not, then all old files get processed in the first batch, and so no file is
left behind.
3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch
process the latest X files. That sets the threshold latest file - maxFileAge,
so files older than this threshold will never be considered for processing.
The bug is with case 3.
**The Solution**
Ignore `maxFileAge` when both `maxFilesPerTrigger` and `latestFirst` are
set.
## How was this patch tested?
Regression test in `FileStreamSourceSuite`
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/brkyvz/spark maxFileAge
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17153.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17153
----
commit b2e7365aaa487143f3bb8fe9d07dd3b8651e176f
Author: Burak Yavuz <[email protected]>
Date: 2017-03-03T22:06:34Z
ready for review
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]