Hi guys, (apologize for the huge font, reposting),
I’ve encountered some problems with a crashed Spark Streaming job, when restoring from checkpoint. I’m runnning spark 1.5.1 on Yarn (hadoop 2.6) in cluster mode, reading from Kafka with the direct consumer and a few updateStateByKey stateful transformations. After investigating, I think the following happened: * Active ResourceManager crashed (aws machine crashed) * 10 minutes later — default Yarn settings :( — Standby took over and redeployed the job, sending a SIGTERM to the running driver * Recovery from checkpoint failed because of missing RDD in checkpoint folder One complication - UNCONFIRMED because of missing logs – I believe that the new driver was started ~5 minutes before the old one stopped. With your help, I’m trying to zero in on a root cause or a combination of: * bad Yarn/Spark configuration (10 minutes to react to missing node, already fixed through more aggressive liveliness settings) * YARN fact of life – why is running job redeployed when standby RM takes over? * Bug/race condition in spark checkpoint cleanup/recovery? (why is RDD cleaned up by the old app and then recovery fails when it looks for it?) * Bugs in the Yarn-Spark integration (missing heartbeats? Why is the new app started 5 minutes before the old one dies?) * Application code – should we add graceful shutdown? Should I add a Zookeeper lock that prevents 2 instances of the driver starting at the same time? Sorry if the questions are a little all over the place, getting to the root cause of this was a pain and I can’t even log an issue in Jira without your help. Attaching some logs that showcase the checkpoint recovery failure (I’ve grepped for “checkpoint” to highlight the core issue): * Driver logs prior to shutdown: http://pastebin.com/eKqw27nT * Driver logs, failed recovery: http://pastebin.com/pqACKK7W * Other info: * spark.streaming.unpersist = true * spark.cleaner.ttl = 259200 (3 days) Last question – in the checkpoint recovery process I notice that it’s going back ~6 minutes on the persisted RDDs and ~10 minutes to replay from kafka. I’m running with 20 second batches and 100 seconds checkpoint interval (small issue - one of the RDDs was using the default interval of 20 secs). Shouldn’t the lineage be a lot smaller? Based on the documentation I would have expected that the recovery goes back at most 100 seconds, as I’m not doing any windowed operations… Thanks in advance! -adrian