> > > >> - Fault tolerance and execution model: Spark assumes fine-grained >> task recovery, i.e. if something fails, only that task is rerun. This >> doesn’t match the execution model of distributed ML/DL frameworks that are >> typically MPI-based, and rerunning a single task would lead to the entire >> system hanging. A whole stage needs to be re-run. >> >> This is not only useful for integrating with 3rd-party frameworks, but > also useful for scaling MLlib algorithms. One of my earliest attempts in > Spark MLlib was to implement All-Reduce primitive (SPARK-1485 > <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up with > some compromised solutions. With the new execution model, we can set up a > hybrid cluster and do all-reduce properly. > > Is there a particular new execution model you are referring to or do we plan to investigate a new execution model ? For the MPI-like model, we also need gang scheduling (i.e. schedule all tasks at once or none of them) and I dont think we have support for that in the scheduler right now.
> >> -- > > Xiangrui Meng > > Software Engineer > > Databricks Inc. [image: http://databricks.com] <http://databricks.com/> >