On Fri, Nov 7, 2014 at 6:20 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> If, for example, you have a cluster of 100 machines, this means the
>> scheduler can launch 150 tasks per machine per second.
>
>
> Did you mean 15 tasks per machine per second here? Or alternatively, 10
> machines?
>
Yes -- sorry for the terrible math there!

>
> I don't know of any existing Spark clusters that have a large enough
>> number of machines or short enough tasks to justify the added complexity of
>> distributing the scheduler.
>
>
> Actually, this was the reason I took interest in Sparrow--specifically,
> the idea of a Spark cluster handling many very short (<< 50 ms) tasks.
>
> At the recent Spark Committer Night
> <http://www.meetup.com/Spark-NYC/events/209271842/> in NYC, I asked
> Michael if he thought that Spark SQL could eventually completely fill the
> need for very low latency queries currently served by MPP databases like
> Redshift or Vertica. If I recall correctly, he said that the main obstacle
> to that was simply task startup time, which is on the order of 100 ms.
>
> Is there interest in (or perhaps an existing initiative related to)
> improving task startup times to the point where one could legitimately look
> at Spark SQL as a low latency database that can serve many users or
> applications at once? That would probably make a good use case for Sparrow,
> no?
>

Shorter tasks would indeed be a good use case for Sparrow, and was the
motivation behind the Sparrow work.  When evaluating Sparrow, we focused on
running SQL workloads where tasks were in the 50-100ms range (detailed in
the paper <http://people.csail.mit.edu/matei/papers/2013/sosp_sparrow.pdf>).

I know Evan, who I added here, has been looking at task startup times in
the context of ML workloads; this motivated some recent work (e.g.,
https://issues.apache.org/jira/browse/SPARK-3984) to improve metrics shown
in the UI to describe task launch overhead.  For jobs we've looked at, task
startup time was at most tens of milliseconds (I also remember this being
the case when we ran short tasks on Sparrow).  Decreasing this seems like
it would be widely beneficial, especially if there are cases where it's
more like 100ms, as Michael alluded.  Hopefully some of the improved UI
reporting will help to understand the degree to which this is (or is not)
an issue.  I'm not sure how much Evan is attempting to quantify the
overhead versus fix it -- so I'll let him chime in here.


> Nick
>
>

Reply via email to