GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/3419
[SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files
from being processed multiple times
Because of a corner case, a file already selected for batch t can get
considered again for batch t+2. This refactoring fixes it by remembering all
the files selected in the last 1 minute, so that this corner case does not
arise. Also uses spark context's hadoop configuration to access the file system
API for listing directories.
@pwendell Please take look. I still have not run long-running integration
tests, so I cannot say for sure whether this has indeed solved the issue. You
could do a first pass on this in the meantime.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark filestream-fix2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3419.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3419
----
commit 9dbd40abe4627fd43f664630512540d62bb785e1
Author: Tathagata Das <[email protected]>
Date: 2014-11-23T01:24:41Z
Refactored FileInputDStream to remember last few batches.
commit eaef4e1ac11929f9aef7f57b8f52d26e3c048901
Author: Tathagata Das <[email protected]>
Date: 2014-11-23T01:26:27Z
Fixed SPARK-4519
commit 203bbc72d6cc96c67fbf86f4d137b3e7fa8afd30
Author: Tathagata Das <[email protected]>
Date: 2014-11-23T01:28:23Z
Un-ignore tests.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]