GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/7825
[SPARK-9217][STREAMING] Make the kinesis receiver reliable by recording
sequence numbers
This PR is the second one in the larger issue of making the Kinesis
integration reliable and provide WAL-free at-least once guarantee. It is based
on the design doc -
https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit
In this PR, I have updated the Kinesis Receiver to do the following.
- Control the block generation, by creating its own BlockGenerator with own
callback methods and using it to keep track of the ranges of sequence numbers
that go into each block.
- More specifically, as the KinesisRecordProcessor provides small batches
of records, the records are atomically inserted into the block (that is, either
the whole batch is in the block, or not). Accordingly the sequence number range
of the batch is recorded. Since there may be many batches added to a block, the
receiver reports an array of ranges as block metadata.
- The array of ranges in the block metadata is used to create
KinesisBackedBlockRDDs. The ReceiverInputDStream has been slightly refactored
to allow the creation of KinesisBackedBlockRDDs instead of the
WALBackedBlockRDDs.
Things to be done
- [ ] Add new test to verify that the sequence numbers are recovered.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark kinesis-receiver
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7825.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7825
----
commit 1cd7b6655d1f76295262ad8be1a9174f00667bf6
Author: Tathagata Das <[email protected]>
Date: 2015-07-31T09:28:14Z
Updated kinesis receiver
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]