Hi all, Paul Ogilvie pointed this thread out to me; we overlapped a little at LinkedIn. It’s good to see that this kind of discussion is going on!
I have some thoughts regarding the discussion going on: - Practically speaking, one of the lowest hanging fruit is the ability for Spark to request GPUs (and in general, devices). I would be happy to implement this myself, if I were given the go-ahead. I’m familiar with only YARN, not the Mesos or Kubernetes resource schedulers, though. It would be best to be forward-looking and think about how to request arbitrary linux devices rather than just GPUs. - The discussion here regarding ML/DL seems to focus on DL in particular, and the DL discussion seems to focus vaguely on data-parallel deep learning training.This is probably a fine starting point. - It is generally challenging to utilize a GPU fully in each kernel call, but there are solutions like CUDA MPS to virtualize a physical GPU as many smaller GPUs. However, each physical GPU is still represented as a single character device, e.g., /dev/nvidia0. This does not mesh well with YARN’s GPU isolation by putting each executor in its own cgroup, with only specific *physical* character devices whitelisted. Alas. Supporting CUDA MPS would be good to keep in mind for inference workloads. I could elaborate if desired. - For things like all-reduce to work well, you need to keep in mind your I/O bandwidth. This means that you need to keep in mind your “topology” of your compute devices (be they CPU, GPUs, FPGAs, IPUs, or whatever). I’m not sure if Spark is already aware of this at the ethernet level, forgive me. But I am certain that it is not aware of this at the PCIe level. Ring all-reduce does this automatically for you in some sense when it creates its “ring", but only if you give it control of your full topology, which is the traditional MPI style (i.e., you’re normally not sharing a node with other jobs with MPI). Secondly, Infiniband connections exist for GPUs to talk directly to one another via what is called “GPUDirect", effectively bypassing the CPU and running at the highest bandwidth possible today. This is a very popular approach, and not something that Spark would seemingly be able to touch. So I question Spark’s ability to have a hand in large-scale distributed training of deep learning models. - I would want to know more about claims of UDFs being slow. For perspective, PCI express Gen 3 (Gen 4 is not out yet…) has 12 GB/s bandwidth effectively. Split among 4 GPUs, you have 3 GB/s. In high performance computing, this is always considered the bottleneck. Anyway, this is something I’m particularly interested in. Feel free to poke me if you want me to answer a specific question. Sincerely, Daniel On 2018/05/09 23:31:10, Xiangrui Meng <m...@databricks.com> wrote: > Shivaram: Yes, we can call it "gang scheduling" or "barrier > synchronization". Spark doesn't support it now. The proposal is to have a > proper support in Spark's job scheduler, so we can integrate well with > MPI-like frameworks. > > On Tue, May 8, 2018 at 11:17 AM Nan Zhu <zhunanmcg...@gmail.com> wrote: > > > .....how I skipped the last part........ > > > > On Tue, May 8, 2018 at 11:16 AM, Reynold Xin <r...@databricks.com> wrote: > > > >> Yes, Nan, totally agree. To be on the same page, that's exactly what I > >> wrote wasn't it? > >> > >> On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcg...@gmail.com> wrote: > >> > >>> besides that, one of the things which is needed by multiple frameworks > >>> is to schedule tasks in a single wave > >>> > >>> i.e. > >>> > >>> if some frameworks like xgboost/mxnet requires 50 parallel workers, > >>> Spark is desired to provide a capability to ensure that either we run 50 > >>> tasks at once, or we should quit the complete application/job after some > >>> timeout period > >>> > >>> Best, > >>> > >>> Nan > >>> > >>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com> > >>> wrote: > >>> > >>>> I think that's what Xiangrui was referring to. Instead of retrying a > >>>> single task, retry the entire stage, and the entire stage of tasks need > >>>> to > >>>> be scheduled all at once. > >>>> > >>>> > >>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman < > >>>> shiva...@eecs.berkeley.edu> wrote: > >>>> > >>>>> > >>>>>> > >>>>>>> - Fault tolerance and execution model: Spark assumes > >>>>>>> fine-grained task recovery, i.e. if something fails, only that > >>>>>>> task is > >>>>>>> rerun. This doesn’t match the execution model of distributed ML/DL > >>>>>>> frameworks that are typically MPI-based, and rerunning a single > >>>>>>> task would > >>>>>>> lead to the entire system hanging. A whole stage needs to be > >>>>>>> re-run. > >>>>>>> > >>>>>>> This is not only useful for integrating with 3rd-party frameworks, > >>>>>> but also useful for scaling MLlib algorithms. One of my earliest > >>>>>> attempts > >>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485 > >>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up > >>>>>> with some compromised solutions. With the new execution model, we can > >>>>>> set > >>>>>> up a hybrid cluster and do all-reduce properly. > >>>>>> > >>>>>> > >>>>> Is there a particular new execution model you are referring to or do > >>>>> we plan to investigate a new execution model ? For the MPI-like model, > >>>>> we > >>>>> also need gang scheduling (i.e. schedule all tasks at once or none of > >>>>> them) > >>>>> and I dont think we have support for that in the scheduler right now. > >>>>> > >>>>>> > >>>>>>> -- > >>>>>> > >>>>>> Xiangrui Meng > >>>>>> > >>>>>> Software Engineer > >>>>>> > >>>>>> Databricks Inc. [image: http://databricks.com] > >>>>>> <http://databricks.com/> > >>>>>> > >>>>> > >>>>> > >>> > > -- > > Xiangrui Meng > > Software Engineer > > Databricks Inc. [image: http://databricks.com] <http://databricks.com/> > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org