Re: spark streaming question

Tathagata Das Mon, 05 May 2014 13:10:36 -0700

One main reason why Spark Streaming can achieve higher throughput than
Storm is because Spark Streaming operates in coarser-grained batches -
second-scale massive batches - which reduce per-tuple of overheads in
shuffles, and other kinds of data movements, etc.

Note that, this is also true that this increased throughput does not come
for free: larger batches ---> larger end-to-end latency. Storm may give a
lower end-to-end latency than Spark Streaming (second-scale latency with
second-scale batches). However, we have observed that for a large variety
of streaming usecases, people are often okay with second-scale latencies
but find it much harder work around the atleast-once
semantics  (double-counting, etc.) and lack of in-built state management
(state kept locally in worker can get lost if worker dies). Plus Spark
Streaming has the major advantage of having a simpler, higher-level API
than Storm and the whole Spark ecosystem (Spark SQL, MLlib, etc.) around it
that it can use for writing streaming analytics applications very easily.

Regarding Trident, we have heard from many developers that Trident gives
lower throughput than Storm due to its transactional guarantees. Its hard
to say the reasons behind the performance penalty without doing a very
detailed head-to-head analysis.

TD

On Sun, May 4, 2014 at 5:11 PM, Chris Fregly <ch...@fregly.com> wrote:

> great questions, weide.  in addition, i'd also like to hear more about how
> to horizontally scale a spark-streaming cluster.
>
> i've gone through the samples (standalone mode) and read the
> documentation, but it's still not clear to me how to scale this puppy out
> under high load.  i assume i add more receivers (kinesis, flume, etc), but
> physically how does this work?
>
> @TD:  can you comment?
>
> thanks!
>
> -chris
>
>
> On Sun, May 4, 2014 at 2:10 PM, Weide Zhang <weo...@gmail.com> wrote:
>
>> Hi ,
>>
>> It might be a very general question to ask here but I'm curious to know
>> why spark streaming can achieve better throughput than storm as claimed in
>> the spark streaming paper. Does it depend on certain use cases and/or data
>> source ? What drives better performance in spark streaming case or in other
>> ways, what makes storm not as performant as spark streaming ?
>>
>> Also, in order to guarantee exact-once semantics when node failure
>> happens,  spark makes replicas of RDDs and checkpoints so that data can be
>> recomputed on the fly while on Trident case, they use transactional object
>> to persist the state and result but it's not obvious to me which approach
>> is more costly and why ? Any one can provide some experience here ?
>>
>> Thanks a lot,
>>
>> Weide
>>
>
>

Re: spark streaming question

Reply via email to