[ https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15841009#comment-15841009 ]
Gaurav Shah commented on SPARK-19304: ------------------------------------- There are two issues in `KinesisSequenceRangeIterator.getNext` * when `internalIterator` has exhausted it makes call to `getRecords` which in turn makes two api call , one to get shardIterator & other to get records. This can be avoided by storing the nextIterator sequence number returned from `getRecordsAndNextKinesisIterator` * Second issue is more complicated where each checkpoint block is used as a separate api call. Consider there is one shard and 10 checkpoint blocks. for each block ( or range) we invoke `new KinesisSequenceRangeIterator`, which makes separate `getRecords` call. This api call might return much more records than what is actually required by this range. which is then wasted. Now for second range we again call `new KinesisSequenceRangeIterator` which might have records returned from first api call itself. Example: ranges \[1-10\],\[11-20\],\[21-30\],\[31-40\]. First "new KinesisSequenceRangeIterator" will get "\[1-30\]" records but will ignore all records post "10". The next "new KinesisSequenceRangeIterator" will again get records from "\[11-40\]" but will make use of only "\[11-20\]" > Kinesis checkpoint recovery is 10x slow > --------------------------------------- > > Key: SPARK-19304 > URL: https://issues.apache.org/jira/browse/SPARK-19304 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.0.0 > Environment: using s3 for checkpoints using 1 executor, with 19g mem > & 3 cores per executor > Reporter: Gaurav Shah > Labels: kinesis > > Application runs fine initially, running batches of 1hour and the processing > time is less than 30 minutes on average. For some reason lets say the > application crashes, and we try to restart from checkpoint. The processing > now takes forever and does not move forward. We tried to test out the same > thing at batch interval of 1 minute, the processing runs fine and takes 1.2 > minutes for batch to finish. When we recover from checkpoint it takes about > 15 minutes for each batch. Post the recovery the batches again process at > normal speed > I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown. > Stackoverflow post with more details: > http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org