Hi all, Xiangrui and I were discussing with a heavy Apache Spark user last week on their experiences integrating machine learning (and deep learning) frameworks with Spark and some of their pain points. Couple things were obvious and I wanted to share our learnings with the list.
(1) Most organizations already use Spark for data plumbing and want to be able to run their ML part of the stack on Spark as well (not necessarily re-implementing all the algorithms but by integrating various frameworks like tensorflow, mxnet with Spark). (2) The integration is however painful, from the systems perspective: - Performance: data exchange between Spark and other frameworks are slow, because UDFs across process boundaries (with native code) are slow. This works much better now with Pandas UDFs (given a lot of the ML/DL frameworks are in Python). However, there might be some low hanging fruit gaps here. - Fault tolerance and execution model: Spark assumes fine-grained task recovery, i.e. if something fails, only that task is rerun. This doesn’t match the execution model of distributed ML/DL frameworks that are typically MPI-based, and rerunning a single task would lead to the entire system hanging. A whole stage needs to be re-run. - Accelerator-aware scheduling: The DL frameworks leverage GPUs and sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t aware of those resources, leading to either over-utilizing the accelerators or under-utilizing the CPUs. The good thing is that none of these seem very difficult to address (and we have already made progress on one of them). Xiangrui has graciously accepted the challenge to come up with solutions and SPIP to these. Xiangrui - please also chime in if I didn’t capture everything.