Hi all,

Xiangrui and I were discussing with a heavy Apache Spark user last week on
their experiences integrating machine learning (and deep learning)
frameworks with Spark and some of their pain points. Couple things were
obvious and I wanted to share our learnings with the list.

(1) Most organizations already use Spark for data plumbing and want to be
able to run their ML part of the stack on Spark as well (not necessarily
re-implementing all the algorithms but by integrating various frameworks
like tensorflow, mxnet with Spark).

(2) The integration is however painful, from the systems perspective:


   - Performance: data exchange between Spark and other frameworks are
   slow, because UDFs across process boundaries (with native code) are slow.
   This works much better now with Pandas UDFs (given a lot of the ML/DL
   frameworks are in Python). However, there might be some low hanging fruit
   gaps here.


   - Fault tolerance and execution model: Spark assumes fine-grained task
   recovery, i.e. if something fails, only that task is rerun. This doesn’t
   match the execution model of distributed ML/DL frameworks that are
   typically MPI-based, and rerunning a single task would lead to the entire
   system hanging. A whole stage needs to be re-run.


   - Accelerator-aware scheduling: The DL frameworks leverage GPUs and
   sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
   aware of those resources, leading to either over-utilizing the accelerators
   or under-utilizing the CPUs.


The good thing is that none of these seem very difficult to address (and we
have already made progress on one of them). Xiangrui has graciously
accepted the challenge to come up with solutions and SPIP to these.

Xiangrui - please also chime in if I didn’t capture everything.

Reply via email to