Hi Mark, For input streams like text input stream, only RDDs can be recovered from checkpoint, no missed files, if file is missed, actually an exception will be raised. If you use HDFS, HDFS will guarantee no data loss since it has 3 copies.Otherwise user logic has to guarantee no file deleted before recovering.
For input stream which is receiver based, like Kafka input stream or socket input stream, a WAL(write ahead log) mechanism can be enabled to store the received data as well as metadata, so data can be recovered from failure. Thanks Jerry -----Original Message----- From: mkhaitman [mailto:mark.khait...@chango.com] Sent: Monday, February 23, 2015 10:54 AM To: dev@spark.apache.org Subject: StreamingContext textFileStream question Hello, I was interested in creating a StreamingContext textFileStream based job, which runs for long durations, and can also recover from prolonged driver failure... It seems like StreamingContext checkpointing is mainly used for the case when the driver dies during the processing of an RDD, and to recover that one RDD, but my question specifically relates to whether there is a way to also recover which files were missed between the timeframe of the driver dying and being started back up (whether manually or automatically). Any assistance/suggestions with this one would be greatly appreciated! Thanks, Mark. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org