GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/3419

    [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files 
from being processed multiple times

    Because of a corner case, a file already selected for batch t can get 
considered again for batch t+2. This refactoring fixes it by remembering all 
the files selected in the last 1 minute, so that this corner case does not 
arise. Also uses spark context's hadoop configuration to access the file system 
API for listing directories.
    
    @pwendell Please take look. I still have not run long-running integration 
tests, so I cannot say for sure whether this has indeed solved the issue. You 
could do a first pass on this in the meantime. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark filestream-fix2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3419
    
----
commit 9dbd40abe4627fd43f664630512540d62bb785e1
Author: Tathagata Das <[email protected]>
Date:   2014-11-23T01:24:41Z

    Refactored FileInputDStream to remember last few batches.

commit eaef4e1ac11929f9aef7f57b8f52d26e3c048901
Author: Tathagata Das <[email protected]>
Date:   2014-11-23T01:26:27Z

    Fixed SPARK-4519

commit 203bbc72d6cc96c67fbf86f4d137b3e7fa8afd30
Author: Tathagata Das <[email protected]>
Date:   2014-11-23T01:28:23Z

    Un-ignore tests.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to