[jira] [Commented] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

Gaurav Shah (JIRA) Thu, 26 Jan 2017 21:09:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15841009#comment-15841009
 ]


Gaurav Shah commented on SPARK-19304:
-------------------------------------

There are two issues in `KinesisSequenceRangeIterator.getNext`

* when `internalIterator` has exhausted it makes call to `getRecords` which in 
turn makes two api call , one to get shardIterator & other to get records. This 
can be avoided by storing the nextIterator sequence number returned from 
`getRecordsAndNextKinesisIterator`

* Second issue is more complicated where each checkpoint block is used as a 
separate api call. Consider there is one shard and 10 checkpoint blocks. for 
each block ( or range) we invoke `new KinesisSequenceRangeIterator`, which 
makes separate `getRecords` call. This api call might return much more records 
than what is actually required by this range. which is then wasted. Now for 
second range we again call  `new KinesisSequenceRangeIterator` which might have 
records returned from first api call itself.

Example:
ranges \[1-10\],\[11-20\],\[21-30\],\[31-40\]. First "new 
KinesisSequenceRangeIterator" will get "\[1-30\]" records but will ignore all 
records post "10". The next "new KinesisSequenceRangeIterator" will again get 
records from "\[11-40\]" but will make use of only "\[11-20\]"

> Kinesis checkpoint recovery is 10x slow
> ---------------------------------------
>
>                 Key: SPARK-19304
>                 URL: https://issues.apache.org/jira/browse/SPARK-19304
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>         Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>            Reporter: Gaurav Shah
>              Labels: kinesis
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

Reply via email to