I am committer on the MXNet project and very interested in working on Integrating with Spark. I am wondering how would training proceed in case of 1) training is done on one host with multiple GPUs -- I don't know if Spark's capabilities can leveraged here 2) distributed training with data parallelism -- how can we leverage Spark's map reduce model to fit distributed training. model of execution here is more of iterative in nature.
Please let me know. Thanks, Naveen On Tue, May 8, 2018 at 8:53 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > >> >>> - Fault tolerance and execution model: Spark assumes fine-grained >>> task recovery, i.e. if something fails, only that task is rerun. This >>> doesn’t match the execution model of distributed ML/DL frameworks that >>> are >>> typically MPI-based, and rerunning a single task would lead to the entire >>> system hanging. A whole stage needs to be re-run. >>> >>> This is not only useful for integrating with 3rd-party frameworks, but >> also useful for scaling MLlib algorithms. One of my earliest attempts in >> Spark MLlib was to implement All-Reduce primitive (SPARK-1485 >> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up >> with some compromised solutions. With the new execution model, we can set >> up a hybrid cluster and do all-reduce properly. >> >> > Is there a particular new execution model you are referring to or do we > plan to investigate a new execution model ? For the MPI-like model, we > also need gang scheduling (i.e. schedule all tasks at once or none of them) > and I dont think we have support for that in the scheduler right now. > >> >>> -- >> >> Xiangrui Meng >> >> Software Engineer >> >> Databricks Inc. [image: http://databricks.com] <http://databricks.com/> >> > >