Re: Machine Learning on Spark [long rambling discussion email]

Matei Zaharia Wed, 24 Jul 2013 11:43:09 -0700

BTW I should also add about Mahout that it might also make sense for Mahout to 
call MLlib internally. I just haven't looked into it enough to decide whether 
we'd want to provide more than input/output wrappers. But it would certainly be 
great to have Mahout people help out with MLlib in some way.


Matei

On Jul 24, 2013, at 11:39 AM, Matei Zaharia <[email protected]> wrote:

> Hey Nick,
> 
> Thanks for your interest in this stuff! I'm going to let the MLbase team 
> answer this in more detail, but just some quick answers on the first part of 
> your email:
> 
> - From my point of view, the ML library in Spark is meant to be just a 
> library of "kernel" functions you can call, not a complete ETL and data 
> format system like Mahout. The goal was to have good implementations of 
> common algorithms that different higher-level systems (e.g. MLbase, Shark, 
> PySpark) can call into.
> 
> - We wanted to try keeping this in Spark initially to make it a kind of 
> "standard library". This is something that can help ensure it becomes 
> high-quality over time and keep it supported by the project. If you think 
> about it, projects like R and Matlab are very strong primarily because they 
> have great standard libraries. This was also one of the things we thought 
> would differentiate us from Hadoop and Mahout. However, we will of course see 
> how things go and separate it out if it needs a faster dev cycle.
> 
> - I haven't worried much about compatibility with Mahout because I'm not sure 
> Mahout is too widely used and I'm not sure its abstractions are best. Mahout 
> is very tied to HDFS, SequenceFiles, etc. We will of course try to 
> interoperate well with data from Mahout, but at least as far as I was 
> concerned, I wanted an API that makes sense for Spark users.
> 
> - Something that's maybe not clear about the MLlib API is that we also want 
> it to be used easily from Java and Python. So we've explicitly avoided having 
> very high-level types or using Scala-specific features, in order to get 
> something that will be simple to call from these languages. This does leave 
> room for wrappers that provide higher-level interfaces.
> 
> In any case, if you like this "kernel" design for MLlib, it would be great to 
> get more people contributing to it, or to get it used in other projects. I'll 
> let the MLbase folks talk about higher-level interfaces -- this is definitely 
> something they want to do, but they might be able to use help. In any case 
> though, sharing the low-level kernels across Spark projects would make a lot 
> of sense.
> 
> Matei
> 
> On Jul 24, 2013, at 1:46 AM, Nick Pentreath <[email protected]> wrote:
> 
>> Hi dev team
>> 
>> (Apologies for a long email!)
>> 
>> Firstly great news about the inclusion of MLlib into the Spark project!
>> 
>> I've been working on a concept and some code for a machine learning library
>> on Spark, and so of course there is a lot of overlap between MLlib and what
>> I've been doing.
>> 
>> I wanted to throw this out there and (a) ask a couple of design and roadmap
>> questions about MLLib, and (b) talk about how to work together / integrate
>> my ideas (if at all :)
>> 
>> *Some questions*
>> *
>> *
>> 1. What is the general design idea behind MLLib - is it aimed at being a
>> collection of algorithms, ie a library? Or is it aimed at being a "Mahout
>> for Spark", i.e. something that can be used as a library as well as a set
>> of tools for things like running jobs, feature extraction, text processing
>> etc?
>> 2. How married are we to keeping it within the Spark project? While I
>> understand the reasoning behind it I am not convinced it's best. But I
>> guess we can wait and see how it develops
>> 3. Some of the original test code I saw around the Block ALS did use Breeze
>> (https://github.com/dlwh/breeze) for some of the linear algebra. Now I see
>> everything is using JBLAS directly and Array[Double]. Is there a specific
>> reason for this? Is it aimed at creating a separation whereby the linear
>> algebra backend could be switched out? Scala 2.10 issues?
>> 4. Since Spark is meant to be nicely compatible with Hadoop, do we care
>> about compatibility/integration with Mahout? This may also encourage Mahout
>> developers to switch over and contribute their expertise (see for example
>> Dmitry's work at:
>> https://github.com/dlyubimov/mahout-commits/commits/dev-0.8.x-scala/math-scala,
>> where he is doing a Scala/Spark DSL around mahout-math matrices and
>> distributed operations). Potentially even using mahout-math for linear
>> algebra routines?
>> 5. Is there a roadmap? (I've checked the JIRA which does have a few
>> intended models etc). Who are the devs most involved in this project?
>> 6. What are thoughts around API design for models?
>> 
>> *Some thoughts*
>> *
>> *
>> So, over the past couple of months I have been working on a machine
>> learning library. Initially it was for my own use but I've added a few
>> things and was starting to think about releasing it (though it's not nearly
>> ready). The model that I really needed first was ALS for doing
>> recommendations. So I have ported the ALS code from Mahout to Spark. Well,
>> "ported" in some sense - mostly I copied the algorithm and data
>> distribution design, using Spark's primitives and Breeze for all the linear
>> algebra.
>> 
>> I found it pretty straightforward to port over. So far I have done local
>> testing only on the Movielens datasets. I have found my RMSE results to
>> match that of Mahout's. Overall interestingly the wall clock performance is
>> not as dissimilar as I would have expected. But I would like to now do some
>> larger-scale tests on a cluster to really do a good comparison.
>> 
>> Obviously with Spark's Block ALS model, my version is now somewhat
>> superfluous since I expect (and have so far seen in my simple local
>> experiments) that the block model will significantly outperform. I will
>> probably be porting my use case over to this in due time once I've done
>> further testing.
>> 
>> I also found Breeze to be very nice to work with and like the DSL - hence
>> my question about why not use that? (Especially now that Breeze is actually
>> just breeze-math and breeze-viz).
>> 
>> Anyway, I then added KMeans (basically just the Spark example with some
>> Breeze tweaks), and started working on a Linear Model framework. I've also
>> added a simple framework for arg parsing and config (using Twitter
>> Algebird's Args and Typesafe Config), and have started on feature
>> extraction stuff - of particular interest will be text feature extraction
>> and feature hashing.
>> 
>> This is roughly the idea for a machine learning library on Spark that I
>> have - call it a design or manifesto or whatever:
>> 
>> - Library available and consistent across Scala, Java and Python (as much
>> as possible in any event)
>> - A core library and also a set of stuff for easily running models based on
>> standard input formats etc
>> - Standardised model API (even across languages) to the extent possible.
>> I've based mine so far on Python's scikit-learn (.fit(), .predict() etc).
>> Why? I believe it's a major strength of scikit-learn, that its API is so
>> clean, simple and consistent. Plus, for the Python version of the lib,
>> scikit-learn will no doubt be used wherever possible to avoid re-creating
>> code
>> - Models to be included initially:
>> - ALS
>> - Possibly co-occurrence recommendation stuff similar to Mahout's Taste
>> - Clustering (K-Means and others potentially)
>> - Linear Models - the idea here is to have something very close to Vowpal
>> Wabbit, ie a generic SGD engine with various Loss Functions, learning rate
>> paradigms etc. Furthermore this would allow other models similar to VW such
>> as online versions of matrix factorisation, neural nets and learning
>> reductions
>> - Possibly Decision Trees / Random Forests
>> - Some utilities for feature extraction (hashing in particular), and to
>> make running jobs easy (integration with Spark's ./run etc?)
>> - Stuff for making pipelining easy (like scikit-learn) and for doing things
>> like cross-validation in a principled (and parallel) way
>> - Clean and easy integration with Spark Streaming for online models (e.g. a
>> linear SGD can be called with fit() on batch data, and then fit() and/or
>> fit/predict() on streaming data to learn further online etc).
>> - Interactivity provided by shells (IPython, Spark shell) and also plotting
>> capability (Matplotlib, and Breeze Viz)
>> - For Scala, integration with Shark via sql2rdd etc.
>> - I'd like to create something similar to Scalding's Matrix API based on
>> RDDs for representing distributed matrices, as well as integrate the ideas
>> of Dmitry and Mahout's DistributedRowMatrix etc
>> 
>> Here is a rough outline of the model API I have used at the moment:
>> https://gist.github.com/MLnick/6068841. This works nicely for ALS,
>> clustering, linear models etc.
>> 
>> So as you can see, mostly overlapping with what MLlib already has or has
>> planned in some way, but my main aim is frankly to have consistency in the
>> API, some level of abstraction but to keep things as simple as possible (ie
>> let Spark handle the complex stuff), and thus hopefully avoid things
>> becoming just a somewhat haphazard collection of models that is not that
>> simple to figure out how to use - which is unfortunately what I believe has
>> happened to Mahout.
>> 
>> So the question then is, how to work together or integrate? I see 3 options:
>> 
>> 1. I go my own way (not very appealing obviously)
>> 2. Contribute what I have (or as much as makes sense) to MLlib
>> 3. Create my project as a "front-end" or "wrapper" around MLlib as the
>> core, effectively providing the API and workflow interface but using MLlib
>> as the model engine.
>> 
>> #2 is appealing but then a lot depends on the API and framework design and
>> how much what I have in mind is compatible with the rest of the devs etc
>> #3 now that I have written it, starts to sound pretty interesting -
>> potentially we're looking at a "front-end" that could in fact execute
>> models on Spark (or other engines like Hadoop/Mahout, GraphX etc), while
>> providing workflows for pipelining transformations, feature extraction,
>> testing and cross-validation, and data viz.
>> 
>> But of course #3 starts sounding somewhat like what MLBase is aiming to be
>> (I think)!
>> 
>> At this point I'm willing to show out what I have done so far on a
>> selective basis - be warned though it is rough and not finished and
>> somewhat clunky perhaps as it's my first attempt at a library/framework, if
>> it makes sense. Especially because really the main thing I did was the ALS
>> port, and with MLlib's version of ALS that may be less useful now in any
>> case.
>> 
>> It may be that none of this is that useful to others anyway which is fine
>> as I'll keep developing tools that I need and potentially they will be
>> useful at some point.
>> 
>> Thoughts, feedback, comments, discussion? I really want to jump into MLlib
>> and get involved in contributing to standardised machine learning on Spark!
>> 
>> Nick
>

Re: Machine Learning on Spark [long rambling discussion email]

Reply via email to