GitHub user harishreedharan opened a pull request:

    https://github.com/apache/spark/pull/4964

    [SPARK-6222][STREAMING] Checkpoints should be done only after batches ar...

    ...e completed.
    
    Currently checkpoints are initiated as soon as a batch is submitted for 
processing. We
    do not wait for the processing to complete before starting a checkpoint.
    
    The checkpoint at a given time actually does not reflect what has been 
processed,
    so even for Dstreams that do not use the WAL data loss is possible. For 
example, the direct
    kafka connector, the checkpoint has a range of offsets that have not been 
processed, leading
    to data loss.
    
    In addition, the WAL files which are deleted at the end of a checkpoint may
    contain unprocessed data, if the number of batches waiting to be processed 
is more
    than a handful - and can result in data loss.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/harishreedharan/spark checkpointAfterBatch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4964.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4964
    
----
commit 076b1344776f4e0d0242217aa0c0a823727994ab
Author: Hari Shreedharan <[email protected]>
Date:   2015-03-10T20:22:00Z

    [SPARK-6222][STREAMING] Checkpoints should be done only after batches are 
completed.
    Currently checkpoints are initiated as soon as a batch is submitted for 
processing. We
    do not wait for the processing to complete before starting a checkpoint.
    
    The checkpoint at a given time actually does not reflect what has been 
processed,
    so even for Dstreams that do not use the WAL data loss is possible. For 
example, the direct
    kafka connector, the checkpoint has a range of offsets that have not been 
processed, leading
    to data loss.
    
    In addition, the WAL files which are deleted at the end of a checkpoint may
    contain unprocessed data, if the number of batches waiting to be processed 
is more
    than a handful - and can result in data loss.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to