GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/7825

    [SPARK-9217][STREAMING] Make the kinesis receiver reliable by recording 
sequence numbers

    This PR is the second one in the larger issue of making the Kinesis 
integration reliable and provide WAL-free at-least once guarantee. It is based 
on the design doc - 
https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit
    
    In this PR, I have updated the Kinesis Receiver to do the following. 
    - Control the block generation, by creating its own BlockGenerator with own 
callback methods and using it to keep track of the ranges of sequence numbers 
that go into each block.
    - More specifically, as the KinesisRecordProcessor provides small batches 
of records, the records are atomically inserted into the block (that is, either 
the whole batch is in the block, or not). Accordingly the sequence number range 
of the batch is recorded. Since there may be many batches added to a block, the 
receiver reports an array of ranges as block metadata. 
    - The array of ranges in the block metadata is used to create 
KinesisBackedBlockRDDs. The ReceiverInputDStream has been slightly refactored 
to allow the creation of KinesisBackedBlockRDDs instead of the 
WALBackedBlockRDDs. 
    
    Things to be done
    - [ ] Add new test to verify that the sequence numbers are recovered.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark kinesis-receiver

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7825.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7825
    
----
commit 1cd7b6655d1f76295262ad8be1a9174f00667bf6
Author: Tathagata Das <[email protected]>
Date:   2015-07-31T09:28:14Z

    Updated kinesis receiver

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to