GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/17153

    [SPARK-19813] maxFilesPerTrigger combo latestFirst may miss old files in 
combination with maxFileAge in FileStreamSource

    ## What changes were proposed in this pull request?
    
    **The Problem**
    There is a file stream source option called maxFileAge which limits how old 
the files can be, relative the latest file that has been seen. This is used to 
limit the files that need to be remembered as "processed". Files older than the 
latest processed files are ignored. This values is by default 7 days.
    This causes a problem when both
    latestFirst = true
    maxFilesPerTrigger > total files to be processed.
    Here is what happens in all combinations
    1) latestFirst = false - Since files are processed in order, there wont be 
any unprocessed file older than the latest processed file. All files will be 
processed.
    2) latestFirst = true AND maxFilesPerTrigger is not set - The maxFileAge 
thresholding mechanism takes one batch initialize. If maxFilesPerTrigger is 
not, then all old files get processed in the first batch, and so no file is 
left behind.
    3) latestFirst = true AND maxFilesPerTrigger is set to X - The first batch 
process the latest X files. That sets the threshold latest file - maxFileAge, 
so files older than this threshold will never be considered for processing.
    The bug is with case 3.
    
    **The Solution**
    
    Ignore `maxFileAge` when both `maxFilesPerTrigger` and `latestFirst` are 
set.
    
    
    ## How was this patch tested?
    
    Regression test in `FileStreamSourceSuite`

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark maxFileAge

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17153.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17153
    
----
commit b2e7365aaa487143f3bb8fe9d07dd3b8651e176f
Author: Burak Yavuz <[email protected]>
Date:   2017-03-03T22:06:34Z

    ready for review

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to