GitHub user Gauravshah opened a pull request:
https://github.com/apache/spark/pull/16842
SPARK-19304 fix kinesis slow checkpoint recovery
## What changes were proposed in this pull request?
added a limit to getRecords api call call in KinesisBackedBlockRdd. This
helps reduce the amount of data returned by kinesis api call making the
recovery considerable faster
As we are storing the `fromSeqNum` & `toSeqNum` in checkpoint metadata, we
can also store the number of records. Which can alter be used for api call.
## How was this patch tested?
The patch was manually tested
Apologies for any silly mistakes, opening first pull request
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Gauravshah/spark
kinesis_checkpoint_recovery_fix_2_1_0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16842.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16842
----
commit b5e544a8ec326149b7d03773dd7abf8703ee44a2
Author: Gaurav <[email protected]>
Date: 2017-02-07T19:21:28Z
added limit to kinesis checkpoint backed rdd to reduce number of record
loaded on aws getRecords call
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]