This question is conflating a few different concepts. I think the main question is whether Spark will have a shuffle implementation that streams data rather than persisting it to disk/cache as a buffer. Spark currently decouples the shuffle write from the read using disk/OS cache as a buffer. The two benefits of this approach this are that it allows intra-query fault tolerance and it makes it easier to elastically scale and reschedule work within a job. We consider these to be design requirements (think about jobs that run for several hours on hundreds of machines). Impala, and similar systems like dremel and f1, not offer fault tolerance within a query at present. They also require gang scheduling the entire set of resources that will exist for the duration of a query.
A secondary question is whether our shuffle should have a barrier or not. Spark's shuffle currently has a hard barrier between map and reduce stages. We haven't seen really strong evidence that removing the barrier is a net win. It can help the performance of a single job (modestly), but in the a multi-tenant workload, it leads to poor utilization since you have a lot of reduce tasks that are taking up slots waiting for mappers to finish. Many large scale users of Map/Reduce disable this feature in production clusters for that reason. Thus, we haven't seen compelling evidence for removing the barrier at this point, given the complexity of doing so. It is possible that future versions of Spark will support push-based shuffles, potentially in a mode that remove some of Spark's fault tolerance properties. But there are many other things we can still optimize about the shuffle that would likely come before this. - Patrick On Wed, Jan 7, 2015 at 6:01 PM, 曹雪林 <xuelincao2...@gmail.com> wrote: > Hi, > > I've heard a lot of complain about spark's "pull" style shuffle. Is > there any plan to support "push" style shuffle in the near future? > > Currently, the shuffle phase must be completed before the next stage > starts. While, it is said, in Impala, the shuffled data is "streamed" to > the next stage handler, which greatly saves time. Will spark support this > mechanism one day? > > Thanks --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org