Re: Does RDD checkpointing store the entire state in HDFS?

Tathagata Das Tue, 14 Jul 2015 16:25:59 -0700

1. When you set ssc.checkpoint(checkpointDir), the spark streaming
periodically saves the state RDD (which is a snapshot of all the state
data) to HDFS using RDD checkpointing. In fact, a streaming app with
updateStateByKey will not start until you set checkpoint directory.


2. The updateStateByKey performance is sort of independent of the what is
the source that is being use - receiver based or direct Kafka. The
absolutely performance obvious depends on a LOT of variables, size of the
cluster, parallelization, etc. The key things is that you must ensure
sufficient parallelization at every stage - receiving, shuffles
(updateStateByKey included), and output.

Some more discussion in my talk -
https://www.youtube.com/watch?v=d5UJonrruHk


On Tue, Jul 14, 2015 at 4:11 PM, swetha <[email protected]> wrote:

>
> Hi TD,
>
> I have a question regarding sessionization using updateStateByKey. If near
> real time state needs to be maintained in a Streaming application, what
> happens when the number of RDDs to maintain the state becomes very large?
> Does it automatically get saved to HDFS and reload when needed or do I have
> to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
> performance if I use both DStream Checkpointing for maintaining the state
> and use Kafka Direct approach for exactly once semantics?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Does-RDD-checkpointing-store-the-entire-state-in-HDFS-tp7368p13227.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Does RDD checkpointing store the entire state in HDFS?

Reply via email to