Let me chime in on the discussion as well. Spark Streaming is another
usecase where the scheduler's task-launching throughput and
task-latency can limit the batch interval and the overall latencies
achievable by Spark Streaming. Lets say we want to do batches of 20 ms
(for achieve end-to-end latencies < 50ms) with 100 receivers. If each
receiver chunks received data into 10 ms blocks (required for making
batches every 10 ms), then launches tasks to process those blocks, it
means 100 blocks / second X 100 receivers = 10000 tasks per seconds.
This causes a scalability vs. latency tradeoff - if your limit is 1000
tasks per second (simplifying from 1500), you could either configure
it to use 100 receivers at 100 ms batches (10 blocks/sec), or 1000
receivers at 1 second batches.

Note that I did the calculation without considering processing power
of the worker nodes, task closure size, workload characteristics, etc
and other bottlenecks in the system. This calculation therefore does
not reflect the current performance limits of Spark Streaming, which
are most likely much lower than above due to other overheads. However,
as we optimize things further, I know that the scheduler is going to
be one of the biggest hurdles, and only out-of-the-box approaches like
Sparrow can solve it.

TD

On Sat, Nov 8, 2014 at 10:26 AM, Michael Armbrust
<mich...@databricks.com> wrote:
>>
>> However, I haven't seen it be as
>> high as the 100ms Michael quoted (maybe this was for jobs with tasks that
>> have much larger objects that take a long time to deserialize?).
>>
>
> I was thinking more about the average end-to-end latency for launching a
> query that has 100s of partitions. Its also quite possible that SQLs task
> launch overhead is higher since we have never profiled how much is getting
> pulled into the closures.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to