Let me chime in on the discussion as well. Spark Streaming is another usecase where the scheduler's task-launching throughput and task-latency can limit the batch interval and the overall latencies achievable by Spark Streaming. Lets say we want to do batches of 20 ms (for achieve end-to-end latencies < 50ms) with 100 receivers. If each receiver chunks received data into 10 ms blocks (required for making batches every 10 ms), then launches tasks to process those blocks, it means 100 blocks / second X 100 receivers = 10000 tasks per seconds. This causes a scalability vs. latency tradeoff - if your limit is 1000 tasks per second (simplifying from 1500), you could either configure it to use 100 receivers at 100 ms batches (10 blocks/sec), or 1000 receivers at 1 second batches.
Note that I did the calculation without considering processing power of the worker nodes, task closure size, workload characteristics, etc and other bottlenecks in the system. This calculation therefore does not reflect the current performance limits of Spark Streaming, which are most likely much lower than above due to other overheads. However, as we optimize things further, I know that the scheduler is going to be one of the biggest hurdles, and only out-of-the-box approaches like Sparrow can solve it. TD On Sat, Nov 8, 2014 at 10:26 AM, Michael Armbrust <mich...@databricks.com> wrote: >> >> However, I haven't seen it be as >> high as the 100ms Michael quoted (maybe this was for jobs with tasks that >> have much larger objects that take a long time to deserialize?). >> > > I was thinking more about the average end-to-end latency for launching a > query that has 100s of partitions. Its also quite possible that SQLs task > launch overhead is higher since we have never profiled how much is getting > pulled into the closures. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org