GitHub user harishreedharan opened a pull request:
https://github.com/apache/spark/pull/4964
[SPARK-6222][STREAMING] Checkpoints should be done only after batches ar...
...e completed.
Currently checkpoints are initiated as soon as a batch is submitted for
processing. We
do not wait for the processing to complete before starting a checkpoint.
The checkpoint at a given time actually does not reflect what has been
processed,
so even for Dstreams that do not use the WAL data loss is possible. For
example, the direct
kafka connector, the checkpoint has a range of offsets that have not been
processed, leading
to data loss.
In addition, the WAL files which are deleted at the end of a checkpoint may
contain unprocessed data, if the number of batches waiting to be processed
is more
than a handful - and can result in data loss.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/harishreedharan/spark checkpointAfterBatch
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/4964.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #4964
----
commit 076b1344776f4e0d0242217aa0c0a823727994ab
Author: Hari Shreedharan <[email protected]>
Date: 2015-03-10T20:22:00Z
[SPARK-6222][STREAMING] Checkpoints should be done only after batches are
completed.
Currently checkpoints are initiated as soon as a batch is submitted for
processing. We
do not wait for the processing to complete before starting a checkpoint.
The checkpoint at a given time actually does not reflect what has been
processed,
so even for Dstreams that do not use the WAL data loss is possible. For
example, the direct
kafka connector, the checkpoint has a range of offsets that have not been
processed, leading
to data loss.
In addition, the WAL files which are deleted at the end of a checkpoint may
contain unprocessed data, if the number of batches waiting to be processed
is more
than a handful - and can result in data loss.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]