y = sc.parallelize(x ,2).map( c => c*2)
>>> y.checkpoint
>>> y.count
>>>
>>> Is it possible to read the checkpointed RDD in another application?
>>>
>>>
>>>
>>>
>>
al x = List(1,2,3,4)
>> val y = sc.parallelize(x ,2).map( c => c*2)
>> y.checkpoint
>> y.count
>>
>> Is it possible to read the checkpointed RDD in another application?
>>
>>
>>
>>
>>
>> --
>> View this message in context: http
ion?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---
RDD in another application?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
Sent from the Apache Spark User List mailing list archive at Nabble.
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.
Every reference I can find to checkpoint related to Spark Streaming. But
the method is defined in the core Spark library, not Streaming.
Does it exist solely for streaming, or are there
Checkpoint clears dependencies. You might need checkpoint to cut a
long lineage in iterative algorithms. -Xiangrui
On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote:
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.
When might that be necessary or useful? Presumably I can persist and
replicate my RDD to avoid re-computation, if that's my goal. What
advantage does checkpointing provide over disk persistence with
replication?
On Mon, Apr 21, 2014 at 2:42 PM, Xiangrui Meng men...@gmail.com wrote:
Diana, that is a good question.
When you persist an RDD, the system still remembers the whole lineage of
parent RDDs that created that RDD. If one of the executor fails, and the
persist data is lost (both local disk and memory data will get lost), then
the lineage is used to recreate the RDD. The