I think Kay might be able to give a better answer. The most recent benchmark I remember had the number at at somewhere between 8.6ms and 14.6ms depending on the Spark version ( https://github.com/apache/spark/pull/2030#issuecomment-52715181). Another point to note is that this is the total time to run a null job, so this includes scheduling + task launch + time to send back results etc.
Shivaram On Fri, Nov 7, 2014 at 9:23 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Hmm, relevant quote from section 3.3: > > newer frameworks like Spark [35] reduce the overhead to 5ms. To support >> tasks that complete in hundreds of mil- liseconds, we argue for reducing >> task launch overhead even further to 1ms so that launch overhead >> constitutes at most 1% of task runtime. By maintaining an active thread >> pool for task execution on each worker node and caching binaries, task >> launch overhead can be reduced to the time to make a remote procedure call >> to the slave machine to launch the task. Today’s datacenter networks easily >> allow a RPC to complete within 1ms. In fact, re- cent work showed that 10μs >> RPCs are possible in the short term [26]; thus, with careful engineering, >> we be- lieve task launch overheads of 50μ s are attainable. 50μ s task >> launch overheads would enable even smaller tasks that could read data from >> in-memory or from flash stor- age in order to complete in milliseconds. > > > So it looks like I misunderstood the current cost of task initialization. > It's already as low as 5ms (and not 100ms)? > > Nick > > On Fri, Nov 7, 2014 at 11:15 PM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> >> >> On Fri, Nov 7, 2014 at 8:04 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Sounds good. I'm looking forward to tracking improvements in this area. >>> >>> Also, just to connect some more dots here, I just remembered that there >>> is >>> currently an initiative to add an IndexedRDD >>> <https://issues.apache.org/jira/browse/SPARK-2365> interface. Some >>> interesting use cases mentioned there include (emphasis added): >>> >>> To address these problems, we propose IndexedRDD, an efficient key-value >>> > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by >>> enforcing >>> > key uniqueness and pre-indexing the entries for efficient joins and >>> *point >>> > lookups, updates, and deletions*. >>> >>> >>> GraphX would be the first user of IndexedRDD, since it currently >>> implements >>> > a limited form of this functionality in VertexRDD. We envision a >>> variety of >>> > other uses for IndexedRDD, including *streaming updates* to RDDs, >>> *direct >>> > serving* from RDDs, and as an execution strategy for Spark SQL. >>> >>> >>> Maybe some day we'll have Spark clusters directly serving up point >>> lookups >>> or updates. I imagine the tasks running on clusters like that would be >>> tiny >>> and would benefit from very low task startup times and scheduling >>> latency. >>> Am I painting that picture correctly? >>> >>> Yeah - we painted a similar picture in a short paper last year titled >> "The Case for Tiny Tasks in Compute Clusters" >> http://shivaram.org/publications/tinytasks-hotos13.pdf >> >>> Anyway, thanks for explaining the current status of Sparrow. >>> >>> Nick >>> >> >> >