RE: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-05-14 Thread Haopu Wang
Hi TD, regarding to the performance of updateStateByKey, do you have a JIRA for that so we can watch it? Thank you! From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 8:09 AM To: Krzysztof Zarzycki Cc: user Subject: Re: Is it

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-15 Thread Tathagata Das
Can you clarify more on what you want to do after querying? Is the batch not completed until the querying and subsequent processing has completed? On Tue, Apr 14, 2015 at 10:36 PM, Krzysztof Zarzycki k.zarzy...@gmail.com wrote: Thank you Tathagata, very helpful answer. Though, I would like

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Tathagata Das
Fundamentally, stream processing systems are designed for processing streams of data, not for storing large volumes of data for a long period of time. So if you have to maintain that much state for months, then its best to use another system that is designed for long term storage (like Cassandra)

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Krzysztof Zarzycki
Thank you Tathagata, very helpful answer. Though, I would like to highlight that recent stream processing systems are trying to help users in implementing use case of holding such large (like 2 months of data) states. I would mention here Samza state management

Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Krzysztof Zarzycki
Hey guys, could you please help me with a question I asked on Stackoverflow: https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two ? I'll be really grateful for your help! I'm also pasting the question below: I'm trying to