[
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507478#comment-16507478
]
Wangda Tan commented on SPARK-24374:
------------------------------------
[~mengxr],
Thanks for explanations, use cases look good to me, looking forward to see more
progress of this work :).
For the MPI on YARN, there're two parallel works, one is mpich2-yarn by
[~clarkyzl], I'm not sure if there's any design doc. Another by us when I was
at Pivotal: openmpi on YARN (hamster). The latter one is unfortunately not get
open sourced. Here's a high level design of this work:
[https://www.slideshare.net/hadoop/the-zoo-expands?qid=b2efbd75-97af-4f71-9add-abf84970eaef&v=&b=&from_search=1]
if there's any requirements to YARN for the scheduling part, please let us know
if we could help.
> SPIP: Support Barrier Scheduling in Apache Spark
> ------------------------------------------------
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
> Issue Type: Epic
> Components: ML, Spark Core
> Affects Versions: 3.0.0
> Reporter: Xiangrui Meng
> Assignee: Xiangrui Meng
> Priority: Major
> Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users
> can properly embed distributed DL training as a Spark stage to simplify the
> distributed training workflow. For example, Horovod uses MPI to implement
> all-reduce to accelerate distributed TensorFlow training. The computation
> model is different from MapReduce used by Spark. In Spark, a task in a stage
> doesn’t depend on any other tasks in the same stage, and hence it can be
> scheduled independently. In MPI, all workers start at the same time and pass
> messages around. To embed this workload in Spark, we need to introduce a new
> scheduling model, tentatively named “barrier scheduling”, which launches
> tasks at the same time and provides users enough information and tooling to
> embed distributed DL training. Spark can also provide an extra layer of fault
> tolerance in case some tasks failed in the middle, where Spark would abort
> all tasks and restart the stage.
> {quote}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]