Hi, I am relatively new to Spark and am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic.
1. I want to understand how are DStreams partitioned? 2. How does the partitioning work with mapWithState() or updateStatebyKey() method? 3. In updateStateByKey() does the old state and the new values against a given key processed on same node ? 4. How frequent is the shuffle for updateStateByKey() method ? The state I have to maintaining contains ~ 100000 keys and I want to avoid shuffle every time I update the state , any tips to do it ? Warm Regards Soumitra