Spark Streaming deletes checkpointed RDD then tries to load it after restart

Cosmin Ciobanu Tue, 11 Oct 2016 08:21:29 -0700

This is a follow up for this unanswered October 2015 issue:
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-streaming-failed-recovery-from-checkpoint-td14832.html


The issue is that the Spark driver checkpoints an RDD, deletes it, the job
restarts, and *the new driver tries to load the deleted checkpoint RDD*.

The application is run in YARN, which attempts to restart the application a
number of times (100 in our case), all of which fail due to missing the
deleted RDD. 

Here is a Splunk log which shows the inconsistency in checkpoint behaviour:

*2016-10-09 02:48:43,533* [streaming-job-executor-0] INFO 
org.apache.spark.rdd.ReliableRDDCheckpointData - Done checkpointing RDD
73847 to
hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*,
new parent is RDD 73872
host = ip-10-1-1-13.ec2.internal
*2016-10-09 02:53:14,696* [JobGenerator] INFO 
org.apache.spark.streaming.dstream.DStreamCheckpointData - Deleted
checkpoint file
'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*'
for time 1475981310000 ms
host = ip-10-1-1-13.ec2.internal
*Job restarts here, notice driver host change from ip-10-1-1-13.ec2.internal
to ip-10-1-1-25.ec2.internal.*
*2016-10-09 02:53:30,175* [Driver] INFO 
org.apache.spark.streaming.dstream.DStreamCheckpointData - Restoring
checkpointed RDD for time 1475981310000 ms from file
'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*'
host = ip-10-1-1-25.ec2.internal
*2016-10-09 02:53:30,491* [Driver] ERROR
org.apache.spark.deploy.yarn.ApplicationMaster - User class threw exception:
java.lang.IllegalArgumentException: requirement failed: Checkpoint directory
does not exist:
hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
java.lang.IllegalArgumentException: requirement failed: Checkpoint directory
does not exist:
hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
host = ip-10-1-1-25.ec2.internal

Spark streaming is configured with a microbatch interval of 30 seconds,
checkpoint interval of 120 seconds, and cleaner.ttl of 28800 (8 hours), but
as far as I can tell, this TTL only affects metadata cleanup interval. RDDs
seem to be deleted every 4-5 minutes after being checkpointed.

Running on top of Spark 1.5.1.

Questions: 
- How is checkpoint deletion metadata saved so that in case of a driver
restart the new driver does not read invalid metadata? Is there an interval
between the point when the checkpoint data is deleted and this information
is logged? 
- Why does Spark try to load data checkpointed 4-5 minutes in the past,
given the fact checkpoint interval is 120 seconds and thus 5 minute old data
is stale? There should be a newer checkpoint.
- Would the memory management in Spark 2.0 handle this differently? Since we
have not been able to reproduce the issue outside production environments,
it might be useful to know in advance if there have been changes in this
area.

This issue is constantly causing serious data loss in production
environments, I'd appreciate any assistance with it.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Streaming-deletes-checkpointed-RDD-then-tries-to-load-it-after-restart-tp19409.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Spark Streaming deletes checkpointed RDD then tries to load it after restart

Reply via email to