besides that, one of the things which is needed by multiple frameworks is to schedule tasks in a single wave
i.e. if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark is desired to provide a capability to ensure that either we run 50 tasks at once, or we should quit the complete application/job after some timeout period Best, Nan On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> wrote: > I think that's what Xiangrui was referring to. Instead of retrying a > single task, retry the entire stage, and the entire stage of tasks need to > be scheduled all at once. > > > On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > >> >>> >>>> - Fault tolerance and execution model: Spark assumes fine-grained >>>> task recovery, i.e. if something fails, only that task is rerun. This >>>> doesn’t match the execution model of distributed ML/DL frameworks that >>>> are >>>> typically MPI-based, and rerunning a single task would lead to the >>>> entire >>>> system hanging. A whole stage needs to be re-run. >>>> >>>> This is not only useful for integrating with 3rd-party frameworks, but >>> also useful for scaling MLlib algorithms. One of my earliest attempts in >>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485 >>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up >>> with some compromised solutions. With the new execution model, we can set >>> up a hybrid cluster and do all-reduce properly. >>> >>> >> Is there a particular new execution model you are referring to or do we >> plan to investigate a new execution model ? For the MPI-like model, we >> also need gang scheduling (i.e. schedule all tasks at once or none of them) >> and I dont think we have support for that in the scheduler right now. >> >>> >>>> -- >>> >>> Xiangrui Meng >>> >>> Software Engineer >>> >>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/> >>> >> >>