[ https://issues.apache.org/jira/browse/SPARK-24725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735836#comment-16735836 ]
Angel Conde commented on SPARK-24725: -------------------------------------- Mesos has already the possibility via MPI framework ([https://github.com/apache/mesos/tree/master/mpi]). However that implementation is 5 year old (and seems to be a proof of concept) and I do not know whether that approach could be used using Docker images as executors. Bests > Discuss necessary info and access in barrier mode + Mesos > --------------------------------------------------------- > > Key: SPARK-24725 > URL: https://issues.apache.org/jira/browse/SPARK-24725 > Project: Spark > Issue Type: Story > Components: ML, Spark Core > Affects Versions: 3.0.0 > Reporter: Xiangrui Meng > Priority: Major > > In barrier mode, to run hybrid distributed DL training jobs, we need to > provide users sufficient info and access so they can set up a hybrid > distributed training job, e.g., using MPI. > This ticket limits the scope of discussion to Spark + Mesos. I'm not aware of > MPI support in Mesos. So we should find someone with good knowledge to lead > the discussion here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org