Re: Integrating ML/DL frameworks with Spark

Daniel Galvez Wed, 16 May 2018 23:48:50 -0700

Hi all,

Paul Ogilvie pointed this thread out to me; we overlapped a little at LinkedIn. 
It’s good to see that this kind of discussion is going on!


I have some thoughts regarding the discussion going on:

- Practically speaking, one of the lowest hanging fruit is the ability for 
Spark to request GPUs (and in general, devices). I would be happy to implement 
this myself, if I were given the go-ahead. I’m familiar with only YARN, not the 
Mesos or Kubernetes resource schedulers, though. It would be best to be 
forward-looking and think about how to request arbitrary linux devices rather 
than just GPUs.

- The discussion here regarding ML/DL seems to focus on DL in particular, and 
the DL discussion seems to focus vaguely on data-parallel deep learning 
training.This is probably a fine starting point.

- It is generally challenging to utilize a GPU fully in each kernel call, but 
there are solutions like CUDA MPS to virtualize a physical GPU as many smaller 
GPUs. However, each physical GPU is still represented as a single character 
device, e.g., /dev/nvidia0. This does not mesh well with YARN’s GPU isolation 
by putting each executor in its own cgroup, with only specific *physical* 
character devices whitelisted. Alas. Supporting CUDA MPS would be good to keep 
in mind for inference workloads. I could elaborate if desired.

- For things like all-reduce to work well, you need to keep in mind your I/O 
bandwidth. This means that you need to keep in mind your “topology” of your 
compute devices (be they CPU, GPUs, FPGAs, IPUs, or whatever). I’m not sure if 
Spark is already aware of this at the ethernet level, forgive me. But I am 
certain that it is not aware of this at the PCIe level. Ring all-reduce does 
this automatically for you in some sense when it creates its “ring", but only 
if you give it control of your full topology, which is the traditional MPI 
style (i.e., you’re normally not sharing a node with other jobs with MPI). 
Secondly, Infiniband connections exist for GPUs to talk directly to one another 
via what is called “GPUDirect", effectively bypassing the CPU and running at 
the highest bandwidth possible today. This is a very popular approach, and not 
something that Spark would seemingly be able to touch. So I question Spark’s 
ability to have a hand in large-scale distributed training of deep learning 
models.

- I would want to know more about claims of UDFs being slow. For perspective, 
PCI express Gen 3 (Gen 4 is not out yet…) has 12 GB/s bandwidth effectively. 
Split among 4 GPUs, you have 3 GB/s. In high performance computing, this is 
always considered the bottleneck.

Anyway, this is something I’m particularly interested in. Feel free to poke me 
if you want me to answer a specific question.

Sincerely,
Daniel

On 2018/05/09 23:31:10, Xiangrui Meng <m...@databricks.com> wrote: 
> Shivaram: Yes, we can call it "gang scheduling" or "barrier
> synchronization". Spark doesn't support it now. The proposal is to have a
> proper support in Spark's job scheduler, so we can integrate well with
> MPI-like frameworks.
> 
> On Tue, May 8, 2018 at 11:17 AM Nan Zhu <zhunanmcg...@gmail.com> wrote:
> 
> > .....how I skipped the last part........
> >
> > On Tue, May 8, 2018 at 11:16 AM, Reynold Xin <r...@databricks.com> wrote:
> >
> >> Yes, Nan, totally agree. To be on the same page, that's exactly what I
> >> wrote wasn't it?
> >>
> >> On Tue, May 8, 2018 at 11:14 AM Nan Zhu <zhunanmcg...@gmail.com> wrote:
> >>
> >>> besides that, one of the things which is needed by multiple frameworks
> >>> is to schedule tasks in a single wave
> >>>
> >>> i.e.
> >>>
> >>> if some frameworks like xgboost/mxnet requires 50 parallel workers,
> >>> Spark is desired to provide a capability to ensure that either we run 50
> >>> tasks at once, or we should quit the complete application/job after some
> >>> timeout period
> >>>
> >>> Best,
> >>>
> >>> Nan
> >>>
> >>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin <r...@databricks.com>
> >>> wrote:
> >>>
> >>>> I think that's what Xiangrui was referring to. Instead of retrying a
> >>>> single task, retry the entire stage, and the entire stage of tasks need 
> >>>> to
> >>>> be scheduled all at once.
> >>>>
> >>>>
> >>>> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
> >>>> shiva...@eecs.berkeley.edu> wrote:
> >>>>
> >>>>>
> >>>>>>
> >>>>>>>    - Fault tolerance and execution model: Spark assumes
> >>>>>>>    fine-grained task recovery, i.e. if something fails, only that 
> >>>>>>> task is
> >>>>>>>    rerun. This doesn’t match the execution model of distributed ML/DL
> >>>>>>>    frameworks that are typically MPI-based, and rerunning a single 
> >>>>>>> task would
> >>>>>>>    lead to the entire system hanging. A whole stage needs to be 
> >>>>>>> re-run.
> >>>>>>>
> >>>>>>> This is not only useful for integrating with 3rd-party frameworks,
> >>>>>> but also useful for scaling MLlib algorithms. One of my earliest 
> >>>>>> attempts
> >>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
> >>>>>> <https://issues.apache.org/jira/browse/SPARK-1485>). But we ended up
> >>>>>> with some compromised solutions. With the new execution model, we can 
> >>>>>> set
> >>>>>> up a hybrid cluster and do all-reduce properly.
> >>>>>>
> >>>>>>
> >>>>> Is there a particular new execution model you are referring to or do
> >>>>> we plan to investigate a new execution model ?  For the MPI-like model, 
> >>>>> we
> >>>>> also need gang scheduling (i.e. schedule all tasks at once or none of 
> >>>>> them)
> >>>>> and I dont think we have support for that in the scheduler right now.
> >>>>>
> >>>>>>
> >>>>>>> --
> >>>>>>
> >>>>>> Xiangrui Meng
> >>>>>>
> >>>>>> Software Engineer
> >>>>>>
> >>>>>> Databricks Inc. [image: http://databricks.com]
> >>>>>> <http://databricks.com/>
> >>>>>>
> >>>>>
> >>>>>
> >>>
> > --
> 
> Xiangrui Meng
> 
> Software Engineer
> 
> Databricks Inc. [image: http://databricks.com] <http://databricks.com/>
> 
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Integrating ML/DL frameworks with Spark

Reply via email to