I think that's what Xiangrui was referring to. Instead of retrying a single
task, retry the entire stage, and the entire stage of tasks need to be
scheduled all at once.


On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

>
>>
>>>    - Fault tolerance and execution model: Spark assumes fine-grained
>>>    task recovery, i.e. if something fails, only that task is rerun. This
>>>    doesn’t match the execution model of distributed ML/DL frameworks that 
>>> are
>>>    typically MPI-based, and rerunning a single task would lead to the entire
>>>    system hanging. A whole stage needs to be re-run.
>>>
>>> This is not only useful for integrating with 3rd-party frameworks, but
>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>> with some compromised solutions. With the new execution model, we can set
>> up a hybrid cluster and do all-reduce properly.
>>
>>
> Is there a particular new execution model you are referring to or do we
> plan to investigate a new execution model ?  For the MPI-like model, we
> also need gang scheduling (i.e. schedule all tasks at once or none of them)
> and I dont think we have support for that in the scheduler right now.
>
>>
>>> --
>>
>> Xiangrui Meng
>>
>> Software Engineer
>>
>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>>
>
>

Reply via email to