Re: Replacing Spark's native scheduler with Sparrow

Shivaram Venkataraman Fri, 07 Nov 2014 21:44:03 -0800

I think Kay might be able to give a better answer. The most recent
benchmark I remember had the number at at somewhere between 8.6ms and
14.6ms depending on the Spark version (
https://github.com/apache/spark/pull/2030#issuecomment-52715181). Another
point to note is that this is the total time to run a null job, so this
includes scheduling + task launch + time to send back results etc.


Shivaram

On Fri, Nov 7, 2014 at 9:23 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Hmm, relevant quote from section 3.3:
>
> newer frameworks like Spark [35] reduce the overhead to 5ms. To support
>> tasks that complete in hundreds of mil- liseconds, we argue for reducing
>> task launch overhead even further to 1ms so that launch overhead
>> constitutes at most 1% of task runtime. By maintaining an active thread
>> pool for task execution on each worker node and caching binaries, task
>> launch overhead can be reduced to the time to make a remote procedure call
>> to the slave machine to launch the task. Today’s datacenter networks easily
>> allow a RPC to complete within 1ms. In fact, re- cent work showed that 10μs
>> RPCs are possible in the short term [26]; thus, with careful engineering,
>> we be- lieve task launch overheads of 50μ s are attainable. 50μ s task
>> launch overheads would enable even smaller tasks that could read data from
>> in-memory or from flash stor- age in order to complete in milliseconds.
>
>
> So it looks like I misunderstood the current cost of task initialization.
> It's already as low as 5ms (and not 100ms)?
>
> Nick
>
> On Fri, Nov 7, 2014 at 11:15 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>>
>>
>> On Fri, Nov 7, 2014 at 8:04 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Sounds good. I'm looking forward to tracking improvements in this area.
>>>
>>> Also, just to connect some more dots here, I just remembered that there
>>> is
>>> currently an initiative to add an IndexedRDD
>>> <https://issues.apache.org/jira/browse/SPARK-2365> interface. Some
>>> interesting use cases mentioned there include (emphasis added):
>>>
>>> To address these problems, we propose IndexedRDD, an efficient key-value
>>> > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by
>>> enforcing
>>> > key uniqueness and pre-indexing the entries for efficient joins and
>>> *point
>>> > lookups, updates, and deletions*.
>>>
>>>
>>> GraphX would be the first user of IndexedRDD, since it currently
>>> implements
>>> > a limited form of this functionality in VertexRDD. We envision a
>>> variety of
>>> > other uses for IndexedRDD, including *streaming updates* to RDDs,
>>> *direct
>>> > serving* from RDDs, and as an execution strategy for Spark SQL.
>>>
>>>
>>> Maybe some day we'll have Spark clusters directly serving up point
>>> lookups
>>> or updates. I imagine the tasks running on clusters like that would be
>>> tiny
>>> and would benefit from very low task startup times and scheduling
>>> latency.
>>> Am I painting that picture correctly?
>>>
>>> Yeah - we painted a similar picture in a short paper last year titled
>> "The Case for Tiny Tasks in Compute Clusters"
>> http://shivaram.org/publications/tinytasks-hotos13.pdf
>>
>>> Anyway, thanks for explaining the current status of Sparrow.
>>>
>>> Nick
>>>
>>
>>
>

Re: Replacing Spark's native scheduler with Sparrow

Reply via email to