Re: Will Beam provide a machine learning API in the future?

Simone Robutti Thu, 21 Apr 2016 01:34:24 -0700

I'm new to the Beam project so if I say things that were already discussed,
ignore me, but I would like to give you my two cents from the last months I
spent working on this, scouting different solutions for ML on a distributed
processing engine, so that you will have another perspective on the subject.


The feel, shared with many other ML engineers, is that the open source big
data environment is reinventing the wheel over and over. The same
algorithms are rewritten for every platform and they fall short because
their quality is not comparable with proper ML-oriented solutions. SparkML
was born as a placeholder and evolved in a decent library just because
interest for Spark skyrocketed. But it wasn't their main focus, it wasn't
the best to do proper ML with and it was there aiming for a "battery
included" approach that obviously is not enough for most applications. The
same is true for FlinkML, that is in a very early stage right now.

On the other side, many many libraries and platforms were born to achieve
the same results and appeal to the same audience (big enterprises with
existing infrastructure that want an *easy* way to do ML at scale): Mahout
was one of the first in the Apache foundation but many others followed.
Most of these libraries were bound to a processing engine (MapReduce or
later Spark) and were really hard to port. A good approach, that will be
interesting for Beam, is the one of Samsara, a distributed matrix
operations library: they have a set of algorithms expressed in terms of
"simple" primitives that are directly implemented on Spark, Flink,
MapReduce and so on. The others suffer from portability issues and require
a long effort of integration.

Another approach is the one of H2O: they build their own cloud, with their
own KV storage and communication protocol. They probably didn't meant to be
integrated with other platform in the first place, but they released
Sparkling Water that basically build an H2O cluster inside Spark,
instantiating their clients inside Spark's Executors. This is not a simple
piece of software and most of the complexity comes from the translation
back and forth from Spark to H2O data structures.

So, this is all to say that I see a lot of partial solutions to the
problem, where platforms try to do the work of ML libraries and ML engines,
and I see ML libraries and engines trying to integrate with existing
distributed processing softwares. Both try to fill the gap doing the work
of other softwares instead of doing what they can do best.

I have huge expectations from an ML library/DSL built over Beam because it
has the potential to achieve this separation that is required for a clean
and rational big data ecosystem. It should offer enough primitives (linalg
stuff, optimization algorithms, data structures) and tooling to let people
contribute their own algorithms in Beam before native ML libraries like
SparkML. As i said before, I believe integrating with native libraries
would be a big big error because it would be really hard to sell in an
enterprise environment, because it doesn't really give an added value to
the user and it would probably be a pain to find an unifying model across
different libraries.




2016-04-21 1:30 GMT+02:00 Davor Bonaci <[email protected]>:

> It seems like there's a lot of community interest in ML running on Beam --
> definitely something that we should eventually have in Beam.
>
> Hopefully, we'll be able to coordinate individual efforts to come up with a
> unified API. It fits right in with Beam goals to have a library of ML
> PTransforms that isn't tied to any particular ML backend. Then, users will
> have portability benefits and will be able to make the right choice for
> them for each execution.
>
> Overall, I think this is a complex feature with a really big impact and
> benefit to Beam. As such, it would be great to write up and discuss
> architecture and design in detail first.
>
> --
>
> In terms of specific questions, a library of PTransforms would probably be
> a better start than a DSL (but that doesn't exclude the possibility of a
> DSL some day). There would be a default implementation, and then each
> runner could override it, as appropriate.
>
> I think Simone's warning should be taken into account, however. Definitely
> something to have in mind as the design progresses.
>

Re: Will Beam provide a machine learning API in the future?

Reply via email to