I'm new to the Beam project so if I say things that were already discussed, ignore me, but I would like to give you my two cents from the last months I spent working on this, scouting different solutions for ML on a distributed processing engine, so that you will have another perspective on the subject.
The feel, shared with many other ML engineers, is that the open source big data environment is reinventing the wheel over and over. The same algorithms are rewritten for every platform and they fall short because their quality is not comparable with proper ML-oriented solutions. SparkML was born as a placeholder and evolved in a decent library just because interest for Spark skyrocketed. But it wasn't their main focus, it wasn't the best to do proper ML with and it was there aiming for a "battery included" approach that obviously is not enough for most applications. The same is true for FlinkML, that is in a very early stage right now. On the other side, many many libraries and platforms were born to achieve the same results and appeal to the same audience (big enterprises with existing infrastructure that want an *easy* way to do ML at scale): Mahout was one of the first in the Apache foundation but many others followed. Most of these libraries were bound to a processing engine (MapReduce or later Spark) and were really hard to port. A good approach, that will be interesting for Beam, is the one of Samsara, a distributed matrix operations library: they have a set of algorithms expressed in terms of "simple" primitives that are directly implemented on Spark, Flink, MapReduce and so on. The others suffer from portability issues and require a long effort of integration. Another approach is the one of H2O: they build their own cloud, with their own KV storage and communication protocol. They probably didn't meant to be integrated with other platform in the first place, but they released Sparkling Water that basically build an H2O cluster inside Spark, instantiating their clients inside Spark's Executors. This is not a simple piece of software and most of the complexity comes from the translation back and forth from Spark to H2O data structures. So, this is all to say that I see a lot of partial solutions to the problem, where platforms try to do the work of ML libraries and ML engines, and I see ML libraries and engines trying to integrate with existing distributed processing softwares. Both try to fill the gap doing the work of other softwares instead of doing what they can do best. I have huge expectations from an ML library/DSL built over Beam because it has the potential to achieve this separation that is required for a clean and rational big data ecosystem. It should offer enough primitives (linalg stuff, optimization algorithms, data structures) and tooling to let people contribute their own algorithms in Beam before native ML libraries like SparkML. As i said before, I believe integrating with native libraries would be a big big error because it would be really hard to sell in an enterprise environment, because it doesn't really give an added value to the user and it would probably be a pain to find an unifying model across different libraries. 2016-04-21 1:30 GMT+02:00 Davor Bonaci <[email protected]>: > It seems like there's a lot of community interest in ML running on Beam -- > definitely something that we should eventually have in Beam. > > Hopefully, we'll be able to coordinate individual efforts to come up with a > unified API. It fits right in with Beam goals to have a library of ML > PTransforms that isn't tied to any particular ML backend. Then, users will > have portability benefits and will be able to make the right choice for > them for each execution. > > Overall, I think this is a complex feature with a really big impact and > benefit to Beam. As such, it would be great to write up and discuss > architecture and design in detail first. > > -- > > In terms of specific questions, a library of PTransforms would probably be > a better start than a DSL (but that doesn't exclude the possibility of a > DSL some day). There would be a default implementation, and then each > runner could override it, as appropriate. > > I think Simone's warning should be taken into account, however. Definitely > something to have in mind as the design progresses. >
