Can you clarify more on what you want to do after querying? Is the batch
not completed until the querying and subsequent processing has completed?


On Tue, Apr 14, 2015 at 10:36 PM, Krzysztof Zarzycki <k.zarzy...@gmail.com>
wrote:

> Thank you Tathagata, very helpful answer.
>
> Though, I would like to highlight that recent stream processing systems
> are trying to help users in implementing use case of holding such large
> (like 2 months of data) states. I would mention here Samza state
> management
> <http://samza.apache.org/learn/documentation/0.9/container/state-management.html>
>  and
> Trident state management
> <https://storm.apache.org/documentation/Trident-state>. I'm waiting when
> Spark would help with that too, because generally I definitely prefer this
> technology:)
>
> But considering holding state in Cassandra with Spark Streaming, I
> understand we're not talking here about using Cassandra as input nor output
> (nor make use of spark-cassandra-connector
> <https://github.com/datastax/spark-cassandra-connector>). We're talking
> here about querying Cassandra from map/mapPartition functions.
> I have one question about it: Is it possible to query Cassandra
> asynchronously within Spark Streaming? And while doing it, is it possible
> to take next batch of rows, while the previous is waiting on Cassandra I/O?
> I think (but I'm not sure) this generally asks, whether several consecutive
> windows can interleave (because they are long to process)? Let's draw it:
>
> <------|query Cassandra asynchronously--- > window1
>         <---------------------------------------> window2
>
> While writing it, I start to believe they can, because windows are
> time-triggered, not triggered when previous window has finished... But it's
> better to ask:)
>
>
>
>
> 2015-04-15 2:08 GMT+02:00 Tathagata Das <t...@databricks.com>:
>
>> Fundamentally, stream processing systems are designed for processing
>> streams of data, not for storing large volumes of data for a long period of
>> time. So if you have to maintain that much state for months, then its best
>> to use another system that is designed for long term storage (like
>> Cassandra) which has proper support for making all that state
>> fault-tolerant, high-performant, etc. So yes, the best option is to use
>> Cassandra for the state and Spark Streaming jobs accessing the state from
>> Cassandra. There are a number of optimizations that can be done. Its not
>> too hard to build a simple on-demand populated cache (singleton hash map
>> for example), that speeds up access from Cassandra, and all updates are
>> written through the cache. This is a common use of Spark Streaming +
>> Cassandra/HBase.
>>
>> Regarding the performance of updateStateByKey, we are aware of the
>> limitations, and we will improve it soon :)
>>
>> TD
>>
>>
>> On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki <
>> k.zarzy...@gmail.com> wrote:
>>
>>> Hey guys, could you please help me with a question I asked on
>>> Stackoverflow:
>>> https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two
>>> ?  I'll be really grateful for your help!
>>>
>>> I'm also pasting the question below:
>>>
>>> I'm trying to solve a (simplified here) problem in Spark Streaming:
>>> Let's say I have a log of events made by users, where each event is a tuple
>>> (user name, activity, time), e.g.:
>>>
>>> ("user1", "view", "2015-04-14T21:04Z") ("user1", "click",
>>> "2015-04-14T21:05Z")
>>>
>>> Now I would like to gather events by user to do some analysis of that.
>>> Let's say that output is some analysis of:
>>>
>>> ("user1", List(("view", "2015-04-14T21:04Z"),("click",
>>> "2015-04-14T21:05Z"))
>>>
>>> The events should be kept for even *2 months*. During that time there
>>> might be around *500 milion*of such events, and *millions of unique* users,
>>> which are keys here.
>>>
>>> *My questions are:*
>>>
>>>    - Is it feasible to do such a thing with updateStateByKey on
>>>    DStream, when I have millions of keys stored?
>>>    - Am I right that DStream.window is no use here, when I have 2
>>>    months length window and would like to have a slide of few seconds?
>>>
>>> P.S. I found out, that updateStateByKey is called on all the keys on
>>> every slide, so that means it will be called millions of time every few
>>> seconds. That makes me doubt in this design and I'm rather thinking about
>>> alternative solutions like:
>>>
>>>    - using Cassandra for state
>>>    - using Trident state (with Cassandra probably)
>>>    - using Samza with its state management.
>>>
>>>
>>
>

Reply via email to