Spark streaming - failed recovery from checkpoint

Adrian Tanase Thu, 29 Oct 2015 04:39:34 -0700

Hi guys,

 (apologize for the huge font, reposting),


I’ve encountered some problems with a crashed Spark Streaming job, when 
restoring from checkpoint.
I’m runnning spark 1.5.1 on Yarn (hadoop 2.6) in cluster mode, reading from 
Kafka with the direct consumer and a few updateStateByKey stateful 
transformations.

After investigating, I think the following happened:

  *   Active ResourceManager crashed (aws machine crashed)
  *   10 minutes later — default Yarn settings :( — Standby took over and 
redeployed the job, sending a SIGTERM to the running driver
  *   Recovery from checkpoint failed because of missing RDD in checkpoint 
folder

One complication - UNCONFIRMED because of missing logs – I believe that the new 
driver was started ~5 minutes before the old one stopped.

With your help, I’m trying to zero in on a root cause or a combination of:

  *   bad Yarn/Spark configuration (10 minutes to react to missing node, 
already fixed through more aggressive liveliness settings)
  *   YARN fact of life – why is running job redeployed when standby RM takes 
over?
  *   Bug/race condition in spark checkpoint cleanup/recovery? (why is RDD 
cleaned up by the old app and then recovery fails when it looks for it?)
  *   Bugs in the Yarn-Spark integration (missing heartbeats? Why is the new 
app started 5 minutes before the old one dies?)
  *   Application code – should we add graceful shutdown? Should I add a 
Zookeeper lock that prevents 2 instances of the driver starting at the same 
time?

Sorry if the questions are a little all over the place, getting to the root 
cause of this was a pain and I can’t even log an issue in Jira without your 
help.

Attaching some logs that showcase the checkpoint recovery failure (I’ve grepped 
for “checkpoint” to highlight the core issue):

  *   Driver logs prior to shutdown: http://pastebin.com/eKqw27nT
  *   Driver logs, failed recovery: http://pastebin.com/pqACKK7W
  *
Other info:
     *   spark.streaming.unpersist = true
     *   spark.cleaner.ttl = 259200 (3 days)

Last question – in the checkpoint recovery process I notice that it’s going 
back ~6 minutes on the persisted RDDs and ~10 minutes to replay from kafka.
I’m running with 20 second batches and 100 seconds checkpoint interval (small 
issue - one of the RDDs was using the default interval of 20 secs). Shouldn’t 
the lineage be a lot smaller?
Based on the documentation I would have expected that the recovery goes back at 
most 100 seconds, as I’m not doing any windowed operations…

Thanks in advance!
-adrian

Spark streaming - failed recovery from checkpoint

Reply via email to