besides that, one of the things which is needed by multiple frameworks is
to schedule tasks in a single wave

i.e.

if some frameworks like xgboost/mxnet requires 50 parallel workers, Spark
is desired to provide a capability to ensure that either we run 50 tasks at
once, or we should quit the complete application/job after some timeout
period

Best,

Nan

On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> wrote:

> I think that's what Xiangrui was referring to. Instead of retrying a
> single task, retry the entire stage, and the entire stage of tasks need to
> be scheduled all at once.
>
>
> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>>
>>>
>>>>    - Fault tolerance and execution model: Spark assumes fine-grained
>>>>    task recovery, i.e. if something fails, only that task is rerun. This
>>>>    doesn’t match the execution model of distributed ML/DL frameworks that 
>>>> are
>>>>    typically MPI-based, and rerunning a single task would lead to the 
>>>> entire
>>>>    system hanging. A whole stage needs to be re-run.
>>>>
>>>> This is not only useful for integrating with 3rd-party frameworks, but
>>> also useful for scaling MLlib algorithms. One of my earliest attempts in
>>> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
>>> with some compromised solutions. With the new execution model, we can set
>>> up a hybrid cluster and do all-reduce properly.
>>>
>>>
>> Is there a particular new execution model you are referring to or do we
>> plan to investigate a new execution model ?  For the MPI-like model, we
>> also need gang scheduling (i.e. schedule all tasks at once or none of them)
>> and I dont think we have support for that in the scheduler right now.
>>
>>>
>>>> --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>

Reply via email to