One main reason why Spark Streaming can achieve higher throughput than Storm is because Spark Streaming operates in coarser-grained batches - second-scale massive batches - which reduce per-tuple of overheads in shuffles, and other kinds of data movements, etc.
Note that, this is also true that this increased throughput does not come for free: larger batches ---> larger end-to-end latency. Storm may give a lower end-to-end latency than Spark Streaming (second-scale latency with second-scale batches). However, we have observed that for a large variety of streaming usecases, people are often okay with second-scale latencies but find it much harder work around the atleast-once semantics (double-counting, etc.) and lack of in-built state management (state kept locally in worker can get lost if worker dies). Plus Spark Streaming has the major advantage of having a simpler, higher-level API than Storm and the whole Spark ecosystem (Spark SQL, MLlib, etc.) around it that it can use for writing streaming analytics applications very easily. Regarding Trident, we have heard from many developers that Trident gives lower throughput than Storm due to its transactional guarantees. Its hard to say the reasons behind the performance penalty without doing a very detailed head-to-head analysis. TD On Sun, May 4, 2014 at 5:11 PM, Chris Fregly <ch...@fregly.com> wrote: > great questions, weide. in addition, i'd also like to hear more about how > to horizontally scale a spark-streaming cluster. > > i've gone through the samples (standalone mode) and read the > documentation, but it's still not clear to me how to scale this puppy out > under high load. i assume i add more receivers (kinesis, flume, etc), but > physically how does this work? > > @TD: can you comment? > > thanks! > > -chris > > > On Sun, May 4, 2014 at 2:10 PM, Weide Zhang <weo...@gmail.com> wrote: > >> Hi , >> >> It might be a very general question to ask here but I'm curious to know >> why spark streaming can achieve better throughput than storm as claimed in >> the spark streaming paper. Does it depend on certain use cases and/or data >> source ? What drives better performance in spark streaming case or in other >> ways, what makes storm not as performant as spark streaming ? >> >> Also, in order to guarantee exact-once semantics when node failure >> happens, spark makes replicas of RDDs and checkpoints so that data can be >> recomputed on the fly while on Trident case, they use transactional object >> to persist the state and result but it's not obvious to me which approach >> is more costly and why ? Any one can provide some experience here ? >> >> Thanks a lot, >> >> Weide >> > >