Diana, that is a good question. When you persist an RDD, the system still remembers the whole lineage of parent RDDs that created that RDD. If one of the executor fails, and the persist data is lost (both local disk and memory data will get lost), then the lineage is used to recreate the RDD. The longer the lineage, the more recomputation the system has to do in case of failure, and hence higher recovery time. So its not a good idea to have a very long lineage, as it leads to all sorts of problems, like the one Xiangrui pointed to.
Checkpointing an RDD actually saves the RDD data to HDFS and removes pointers to the parent RDDs (as the data can be regenerated just by reading from the HDFS file). So that RDDs data does not need to be "recomputed" when worker fails, just re-read. In fact, the data is also retained across driver restarts as it is in HDFS. RDD.checkpoint was introduced with streaming because streaming is obvious use case where the lineage will grow infinitely long (for stateful computations where each result depends on all the previously received data). However, this checkpointing is useful for any long running RDD computation, and I know that people have used RDD.checkpoint() independent of streaming. TD On Mon, Apr 21, 2014 at 1:10 PM, Xiangrui Meng <men...@gmail.com> wrote: > Persist doesn't cut lineage. You might run into StackOverflow problem > with a long lineage. See > https://spark-project.atlassian.net/browse/SPARK-1006 for example. > > On Mon, Apr 21, 2014 at 12:11 PM, Diana Carroll <dcarr...@cloudera.com> > wrote: > > When might that be necessary or useful? Presumably I can persist and > > replicate my RDD to avoid re-computation, if that's my goal. What > advantage > > does checkpointing provide over disk persistence with replication? > > > > > > On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> > >> Checkpoint clears dependencies. You might need checkpoint to cut a > >> long lineage in iterative algorithms. -Xiangrui > >> > >> On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll <dcarr...@cloudera.com> > >> wrote: > >> > I'm trying to understand when I would want to checkpoint an RDD rather > >> > than > >> > just persist to disk. > >> > > >> > Every reference I can find to checkpoint related to Spark Streaming. > >> > But > >> > the method is defined in the core Spark library, not Streaming. > >> > > >> > Does it exist solely for streaming, or are there circumstances > unrelated > >> > to > >> > streaming in which I might want to checkpoint...and if so, like what? > >> > > >> > Thanks, > >> > Diana > > > > >