Re: Comparing scikit-learn, Mahout Samsara and SystemML

Trevor Grant Tue, 06 Jun 2017 13:41:43 -0700

Hey Gustavo, et. al.

First off- great topic and thank you for moving it here!

Secondly, Matthias- awesome response- helped me too.

Looping in Mahout-dev, as I think this is super productive (and looking
forward to having archived thread to point to).

Hoping D or S (and others) jump in on this too- but I can quickly speak to
a couple of the things from the Mahout side:

SystemML and Mahout (and others) recognize that 1) MLLib/SparkML have major
shortcomings as distributed ML libraries, mainly that they aren't
extensible, and 2) if you're going to 'roll your own algorithms' it would
be best to have a mathematically expressive way to do that ( a programming
language that makes it easy to follow the math, similar to R)

Your two primary criteria:
Both have GPU support- on the Mahout side, there will be a couple of talk
recordings from NVidia's GTC conference explaining this more that will be
available publicly tomorrow- we'll blast them on twitter, or just reach out
and I'll share link.

Both scale well on Apache Spark

To your secondary criteria- I would say fairly matched- on all except:
c - quality of tools for development - advantage Mahout, this stems from
Mahout being Scala based, and therefor being able to leverage all popular
IDEs with Scala support out of the box (code completion, scala docs, etc)
in addition to using other Scala libraries- for instance pre processing
images with scrimage or other SparkML /MLLib utilities, and the pipeline is
all one set of code.

f - quality of documentation- advantage SystemML,  we're in the middle of a
website reboot and actively seeking to close this gap, but it is a weak
spot for Mahout right now.

Additionally, I would point out that Mahout, because of Scala based DSL-
will integrate into other programs more easily, from a code perspective.
The contra point- SystemML has much better support for exporting models as
PMMLs, which in a microservices architecture, makes SystemML better for
deploying its models (again- we have an open JIRA for PMML support- but at
the moment, SystemML wins).

Finally I would point out, Mahout is built to be engine neutral. This
allows SystemML to do certain distributed optimizations because it KNOWS it
will be running on Spark.  Mahout on the other hand, was built so that you
can change your distributed engine- with no modification to the algorithm
(only the bindings).  To write new bindings, one must simply define what is
the distributed structure of the Distributed Row Matrix, and then define
certain operations (like A %*% B, and A.t %*% A) on those distribute
matrices- the point being- it's much easier than porting code to a new
engine.  The key here- is if/when Spark falls out of favor- Mahout is going
to be the first on the scene with a robust and powerful machine learning
library, or- if internally you switch engines, you'll find porting your
machine learning much easier with Mahout.   This 'feature' seems less
obvious for the 'getting started' user, but is fairly important for the
user with an eye to the long game. Succinctly- the trade-off is
optimization now vs future-proofing your code. The value of this lies a lot
in everyone's personal forecast for the fate of the Apache Spark project ;)

(This neutrality, also supports interesting usecases like hybrid
Spark-batch/Flink-streaming use cases.)

The original post had also asked something about Python vs. Scala.  I'm
going to take liberty of chiming in on that too.  I think Python is an
excellent language, and worth knowing.  I think sklearn is a great ML
package for doing some first stab/playing with the data/prototyping.  I
think certain paradigms of Scala make it much better suited to working in
distributed (the way you must express jobs forces your brain to think in
terms of maping and reducing).  Even though there are claims here and there
about various python frameworks for being good for distributed ML (pyspark,
and others), none have ever really impressed me- imho manually distributing
sklearn would be a more rewarding experience than using any of them.

My .02, and thanks again!

trevor

Trevor Grant
Data Scientist
https://github.com/rawkintrevo
http://stackexchange.com/users/3002022/rawkintrevo
http://trevorgrant.org

*"Fortunate is he, who is able to know the causes of things."  -Virgil*

On Tue, Jun 6, 2017 at 12:56 AM, Matthias Boehm <mboe...@googlemail.com>
wrote:

> Thanks for reaching out Gustavo. An objective discussion of how exactly
> SystemML and Mahout Samsara compare will probably help other people too. In
> order to remove bias, I'm cc'ing Dmitriy and Sebastian from the Samsara
> team, so they they can correct me if needed. Scikit-learn is a great and
> very popular library of algorithms (which nicely integrates with NumPy),
> but I'm excluding it here because it does not focus on large-scale ML.
>
> Fundamentally, both SystemML and Mahout Samsara have a very different
> history and represent different points in the design space for custom
> large-scale machine learning (ML). Mahout started as a library of
> algorithms on Hadoop MapReduce and is, as an overall project, certainly
> more mature and a larger community. Samsara itself is a more recent
> extension for custom large-scale ML on Spark and Flink. In contrast,
> SystemML was build from scratch for custom large-scale ML, originally on
> MapReduce and later Spark. After SystemML's initial open source release in
> 2015, it became just two weeks ago a top-level Apache project and we're
> actively working on growing our community.
>
> From a technical perspective, SystemML follows a compiler approach where
> scripts with R- or Python-like syntax (but only syntax) are automatically
> compiled to hybrid runtime plans, composed of in-memory, singlenode
> operations and operations on MapReduce or Spark. At script level, users
> work with matrices, frames, and scalars without specifying physical data
> properties such as dense/sparse representations, local/distributed storage,
> partitioning or caching. The major advantages are (1) the ability to easily
> write custom large-scale ML algorithms, (2) automatic adaptation to
> different data characteristics (compile distributed operations only if
> needed), and simplified deployment (because the same script can be used for
> large-scale or local computations).
>
> In contrast, Samsara is a domain-specific language (DSL), embedded in the
> host language Scala. Users can either use local matrices or so-called
> Distributed Row Matrices (DRM) for distributed computation. Operations over
> local matrices are executed as is, without further optimization. In
> contrast, operations over DRMs are collected into a DAG of operations and
> lazily optimized and executed on triggering actions such as full
> aggregations, write, or explicit collect into a local matrix. Hence, the
> user is in charge of deciding between local and distributed operations,
> caching, and other data flow properties. At the same time, this lower-level
> specification allows for more control and the ability to escape to explicit
> distributed operations over rows of the DRM if needed.
>
> At compiler and runtime level, there are a number of similarities but also
> major differences. For example, both systems provide different physical
> operators (for instance, for matrix multiplication), chosen depending on
> operation patterns as well as data and cluster characteristics. This
> includes local operators, operators for special patterns like t(X)%*%X,
> broadcast-based, co-partitioning, and shuffle-based operators.
> Additionally, SystemML uses a variety of simplification rewrites, a
> different distributed matrix representation of binary block matrices (w/
> various dense, sparse, and ultra-sparse formats), and fused operators in
> order to reduce scans, intermediates, and exploit sparsity across chains of
> operators. Regarding GPUs, we recently added a GPU backend for
> deep-learning and generally compute-intensive operations as an experimental
> feature in SystemML, and we're actively working on making it
> production-ready. I heard that Mahout is similarly working on GPU support
> but I am not sure about the details.
>
> To summarize, both SystemML and Samsara aim at different abstraction
> levels, and differ substantially in their compiler and runtime internals.
> Of course, there are also shared goals and motivations (such as simplifying
> custom, large-scale ML), but competition is good as it drives improvements.
> I hope this gives a high-level comparison. If you have additional specific
> questions, feel free to ask.
>
> Regards,
> Matthias
>
>
> On Mon, Jun 5, 2017 at 6:56 PM, Gustavo Frederico <
> gustavo.freder...@thinkwrap.com> wrote:
>
> > Greetings,
> >
> > I worked with the theory of SVMs during my Graduate studies and I’m
> > relatively new to existing ML software. Assuming that I want to create
> new
> > scalable ML algorithms starting with the Math, the question is: how do
> > scikit-learn, Mahout Samsara and SystemML compare to each other?
> >
> > I see interesting Python-based frameworks such as scikit-learn, but then
> I
> > read SystemML's article on Wikipedia that made me question the
> distributive
> > scalability of (“pure") Python for large amounts of data:
> >
> > "[...] It was observed that data scientists would write machine learning
> > algorithms in languages such as R and Python for small data. When it came
> > time to scale to big data, a systems programmer would be needed to scale
> > the algorithm in a language such as Scala. This process typically
> involved
> > days or weeks per iteration, and errors would occur translating the
> > algorithms to operate on big data. " ( https://en.wikipedia.org/wiki/
> > Apache_SystemML )
> >
> > And the article starts stating that Apache SystemML has "algorithm
> > customizability via [...] Python-like languages”.
> >
> > Mahout Samsara is based on Scala. PredictionIO (predictionio.incubator.
> > apache.org) algorithms are based on Mahout Samsara and Scala.  I asked
> > Mr. Matthias Boehm at a conference how one could compare Mahout Samsara
> to
> > SystemML. From what I understood, Samsara needs "explicit declarations”
> in
> > expressions for distributed computing, while SystemML doesn’t — please
> > correct me if I’m wrong. Also, SystemML will optimize the entire script,
> > while Samsara will optimize expressions — again, please correct me if I’m
> > wrong.
> >
> > While my main criterion is scalability (cluster, GPU support etc), other
> > criteria to evaluate these frameworks may be: a) public adoption, b)
> active
> > dev community, c) quality of tools for development, d) backing of big
> > companies e) simplicity working with clusters (delegating the
> complexities
> > of clustering to the framework, “hiding” them from the user), f) quality
> of
> > documentation, g) quality of the software itself
> >
> > ( My question was deleted from stats.stackexchange.com for being
> > off-topic and deleted from Stack Overflow for being bound to get answers
> > with "opinions rather than facts” [sic]. I’m very much interested in
> > hearing balanced and insightful comments from the list. )
> >
> > Thank you,
> >
> > Gustavo
>

Re: Comparing scikit-learn, Mahout Samsara and SystemML

Reply via email to