Re: Does RDD checkpointing store the entire state in HDFS?

Tathagata Das Tue, 14 Jul 2015 16:26:44 -0700

BTW, this is more like a user-list kind of mail, than a dev-list. The
dev-list is for Spark developers.


On Tue, Jul 14, 2015 at 4:23 PM, Tathagata Das <t...@databricks.com> wrote:

> 1. When you set ssc.checkpoint(checkpointDir), the spark streaming
> periodically saves the state RDD (which is a snapshot of all the state
> data) to HDFS using RDD checkpointing. In fact, a streaming app with
> updateStateByKey will not start until you set checkpoint directory.
>
> 2. The updateStateByKey performance is sort of independent of the what is
> the source that is being use - receiver based or direct Kafka. The
> absolutely performance obvious depends on a LOT of variables, size of the
> cluster, parallelization, etc. The key things is that you must ensure
> sufficient parallelization at every stage - receiving, shuffles
> (updateStateByKey included), and output.
>
> Some more discussion in my talk -
> https://www.youtube.com/watch?v=d5UJonrruHk
>
>
> On Tue, Jul 14, 2015 at 4:11 PM, swetha <swethakasire...@gmail.com> wrote:
>
>>
>> Hi TD,
>>
>> I have a question regarding sessionization using updateStateByKey. If near
>> real time state needs to be maintained in a Streaming application, what
>> happens when the number of RDDs to maintain the state becomes very large?
>> Does it automatically get saved to HDFS and reload when needed or do I
>> have
>> to use any code like ssc.checkpoint(checkpointDir)?  Also, how is the
>> performance if I use both DStream Checkpointing for maintaining the state
>> and use Kafka Direct approach for exactly once semantics?
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Does-RDD-checkpointing-store-the-entire-state-in-HDFS-tp7368p13227.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: Does RDD checkpointing store the entire state in HDFS?

Reply via email to