GitHub user brucezhao11 opened a pull request:
https://github.com/apache/spark/pull/21929
[Streaming-24970][Kinesis] Create WriteAheadLogBackedBlockRDD for Kinesis
Streaming if WAL is enabled.
## What changes were proposed in this pull request?
By default, KinesisInputDStream creates KinesisBackedBlockRDD. When an
application tries to recover from streaming checkpoint, the RDD will access
Kinesis directly to re-read data. As all partitions in the BlockRDD accesses
Kinesis, and AWS Kinesis only supports 5 concurrency reads per shard per
second, it will touch ProvisionedThroughputExceededException easily. And when
the conflicts are heavy, the recover will be failed.
Mostly, when we use Spark streaming, we will enable WAL. Then we can
recover data from WAL, instead of re-reading from Kinesis directly.
This PR tries to create WriteAheadLogBackedBlockRDD for Kinesis Streaming
if WAL is enabled.
## How was this patch tested?
In KinesisStreamSuite.scala, I will add a test case.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/brucezhao11/spark dev
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21929.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21929
----
commit cbff70a803a59fce0d6121fabd0f5958d7b0436d
Author: bruce_zhao <bruce_zhao@...>
Date: 2018-07-30T14:08:08Z
[Streaming][Kinesis] Create WriteAheadLogBackedBlockRDD for Kinesis
Streaming if WAL is enabled. With this change, it can recover data from WAL,
instead of re-reading from Kinesis directly when an application tries to
recover from streaming checkpoint.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]