1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app with updateStateByKey will not start until you set checkpoint directory.
2. The updateStateByKey performance is sort of independent of the what is the source that is being use - receiver based or direct Kafka. The absolutely performance obvious depends on a LOT of variables, size of the cluster, parallelization, etc. The key things is that you must ensure sufficient parallelization at every stage - receiving, shuffles (updateStateByKey included), and output. Some more discussion in my talk - https://www.youtube.com/watch?v=d5UJonrruHk On Tue, Jul 14, 2015 at 4:11 PM, swetha <swethakasire...@gmail.com> wrote: > > Hi TD, > > I have a question regarding sessionization using updateStateByKey. If near > real time state needs to be maintained in a Streaming application, what > happens when the number of RDDs to maintain the state becomes very large? > Does it automatically get saved to HDFS and reload when needed or do I have > to use any code like ssc.checkpoint(checkpointDir)? Also, how is the > performance if I use both DStream Checkpointing for maintaining the state > and use Kafka Direct approach for exactly once semantics? > > > Thanks, > Swetha > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Does-RDD-checkpointing-store-the-entire-state-in-HDFS-tp7368p13227.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >