Diana, that is a good question.

When you persist an RDD, the system still remembers the whole lineage of
parent RDDs that created that RDD. If one of the executor fails, and the
persist data is lost (both local disk and memory data will get lost), then
the lineage is used to recreate the RDD. The longer the lineage, the more
recomputation the system has to do in case of failure, and hence higher
recovery time. So its not a good idea to have a very long lineage, as it
leads to all sorts of problems, like the one Xiangrui pointed to.

Checkpointing an RDD actually saves the RDD data to HDFS and removes
pointers to the parent RDDs (as the data can be regenerated just by reading
from the HDFS file). So that RDDs data does not need to be "recomputed"
when worker fails, just re-read. In fact, the data is also retained across
driver restarts as it is in HDFS.

RDD.checkpoint was introduced with streaming because streaming is obvious
use case where the lineage will grow infinitely long (for stateful
computations where each result depends on all the previously received
data). However, this checkpointing is useful for any long running RDD
computation, and I know that people have used RDD.checkpoint() independent
of streaming.

TD


On Mon, Apr 21, 2014 at 1:10 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Persist doesn't cut lineage. You might run into StackOverflow problem
> with a long lineage. See
> https://spark-project.atlassian.net/browse/SPARK-1006 for example.
>
> On Mon, Apr 21, 2014 at 12:11 PM, Diana Carroll <dcarr...@cloudera.com>
> wrote:
> > When might that be necessary or useful?  Presumably I can persist and
> > replicate my RDD to avoid re-computation, if that's my goal.  What
> advantage
> > does checkpointing provide over disk persistence with replication?
> >
> >
> > On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng <men...@gmail.com> wrote:
> >>
> >> Checkpoint clears dependencies. You might need checkpoint to cut a
> >> long lineage in iterative algorithms. -Xiangrui
> >>
> >> On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll <dcarr...@cloudera.com>
> >> wrote:
> >> > I'm trying to understand when I would want to checkpoint an RDD rather
> >> > than
> >> > just persist to disk.
> >> >
> >> > Every reference I can find to checkpoint related to Spark Streaming.
> >> > But
> >> > the method is defined in the core Spark library, not Streaming.
> >> >
> >> > Does it exist solely for streaming, or are there circumstances
> unrelated
> >> > to
> >> > streaming in which I might want to checkpoint...and if so, like what?
> >> >
> >> > Thanks,
> >> > Diana
> >
> >
>

Reply via email to