Fundamentally, stream processing systems are designed for processing
streams of data, not for storing large volumes of data for a long period of
time. So if you have to maintain that much state for months, then its best
to use another system that is designed for long term storage (like
Cassandra) which has proper support for making all that state
fault-tolerant, high-performant, etc. So yes, the best option is to use
Cassandra for the state and Spark Streaming jobs accessing the state from
Cassandra. There are a number of optimizations that can be done. Its not
too hard to build a simple on-demand populated cache (singleton hash map
for example), that speeds up access from Cassandra, and all updates are
written through the cache. This is a common use of Spark Streaming +
Cassandra/HBase.

Regarding the performance of updateStateByKey, we are aware of the
limitations, and we will improve it soon :)

TD


On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki <k.zarzy...@gmail.com>
wrote:

> Hey guys, could you please help me with a question I asked on
> Stackoverflow:
> https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two
> ?  I'll be really grateful for your help!
>
> I'm also pasting the question below:
>
> I'm trying to solve a (simplified here) problem in Spark Streaming: Let's
> say I have a log of events made by users, where each event is a tuple (user
> name, activity, time), e.g.:
>
> ("user1", "view", "2015-04-14T21:04Z") ("user1", "click",
> "2015-04-14T21:05Z")
>
> Now I would like to gather events by user to do some analysis of that.
> Let's say that output is some analysis of:
>
> ("user1", List(("view", "2015-04-14T21:04Z"),("click",
> "2015-04-14T21:05Z"))
>
> The events should be kept for even *2 months*. During that time there
> might be around *500 milion*of such events, and *millions of unique* users,
> which are keys here.
>
> *My questions are:*
>
>    - Is it feasible to do such a thing with updateStateByKey on DStream,
>    when I have millions of keys stored?
>    - Am I right that DStream.window is no use here, when I have 2 months
>    length window and would like to have a slide of few seconds?
>
> P.S. I found out, that updateStateByKey is called on all the keys on every
> slide, so that means it will be called millions of time every few seconds.
> That makes me doubt in this design and I'm rather thinking about
> alternative solutions like:
>
>    - using Cassandra for state
>    - using Trident state (with Cassandra probably)
>    - using Samza with its state management.
>
>

Reply via email to