BTW I should also add about Mahout that it might also make sense for Mahout to call MLlib internally. I just haven't looked into it enough to decide whether we'd want to provide more than input/output wrappers. But it would certainly be great to have Mahout people help out with MLlib in some way.
Matei On Jul 24, 2013, at 11:39 AM, Matei Zaharia <[email protected]> wrote: > Hey Nick, > > Thanks for your interest in this stuff! I'm going to let the MLbase team > answer this in more detail, but just some quick answers on the first part of > your email: > > - From my point of view, the ML library in Spark is meant to be just a > library of "kernel" functions you can call, not a complete ETL and data > format system like Mahout. The goal was to have good implementations of > common algorithms that different higher-level systems (e.g. MLbase, Shark, > PySpark) can call into. > > - We wanted to try keeping this in Spark initially to make it a kind of > "standard library". This is something that can help ensure it becomes > high-quality over time and keep it supported by the project. If you think > about it, projects like R and Matlab are very strong primarily because they > have great standard libraries. This was also one of the things we thought > would differentiate us from Hadoop and Mahout. However, we will of course see > how things go and separate it out if it needs a faster dev cycle. > > - I haven't worried much about compatibility with Mahout because I'm not sure > Mahout is too widely used and I'm not sure its abstractions are best. Mahout > is very tied to HDFS, SequenceFiles, etc. We will of course try to > interoperate well with data from Mahout, but at least as far as I was > concerned, I wanted an API that makes sense for Spark users. > > - Something that's maybe not clear about the MLlib API is that we also want > it to be used easily from Java and Python. So we've explicitly avoided having > very high-level types or using Scala-specific features, in order to get > something that will be simple to call from these languages. This does leave > room for wrappers that provide higher-level interfaces. > > In any case, if you like this "kernel" design for MLlib, it would be great to > get more people contributing to it, or to get it used in other projects. I'll > let the MLbase folks talk about higher-level interfaces -- this is definitely > something they want to do, but they might be able to use help. In any case > though, sharing the low-level kernels across Spark projects would make a lot > of sense. > > Matei > > On Jul 24, 2013, at 1:46 AM, Nick Pentreath <[email protected]> wrote: > >> Hi dev team >> >> (Apologies for a long email!) >> >> Firstly great news about the inclusion of MLlib into the Spark project! >> >> I've been working on a concept and some code for a machine learning library >> on Spark, and so of course there is a lot of overlap between MLlib and what >> I've been doing. >> >> I wanted to throw this out there and (a) ask a couple of design and roadmap >> questions about MLLib, and (b) talk about how to work together / integrate >> my ideas (if at all :) >> >> *Some questions* >> * >> * >> 1. What is the general design idea behind MLLib - is it aimed at being a >> collection of algorithms, ie a library? Or is it aimed at being a "Mahout >> for Spark", i.e. something that can be used as a library as well as a set >> of tools for things like running jobs, feature extraction, text processing >> etc? >> 2. How married are we to keeping it within the Spark project? While I >> understand the reasoning behind it I am not convinced it's best. But I >> guess we can wait and see how it develops >> 3. Some of the original test code I saw around the Block ALS did use Breeze >> (https://github.com/dlwh/breeze) for some of the linear algebra. Now I see >> everything is using JBLAS directly and Array[Double]. Is there a specific >> reason for this? Is it aimed at creating a separation whereby the linear >> algebra backend could be switched out? Scala 2.10 issues? >> 4. Since Spark is meant to be nicely compatible with Hadoop, do we care >> about compatibility/integration with Mahout? This may also encourage Mahout >> developers to switch over and contribute their expertise (see for example >> Dmitry's work at: >> https://github.com/dlyubimov/mahout-commits/commits/dev-0.8.x-scala/math-scala, >> where he is doing a Scala/Spark DSL around mahout-math matrices and >> distributed operations). Potentially even using mahout-math for linear >> algebra routines? >> 5. Is there a roadmap? (I've checked the JIRA which does have a few >> intended models etc). Who are the devs most involved in this project? >> 6. What are thoughts around API design for models? >> >> *Some thoughts* >> * >> * >> So, over the past couple of months I have been working on a machine >> learning library. Initially it was for my own use but I've added a few >> things and was starting to think about releasing it (though it's not nearly >> ready). The model that I really needed first was ALS for doing >> recommendations. So I have ported the ALS code from Mahout to Spark. Well, >> "ported" in some sense - mostly I copied the algorithm and data >> distribution design, using Spark's primitives and Breeze for all the linear >> algebra. >> >> I found it pretty straightforward to port over. So far I have done local >> testing only on the Movielens datasets. I have found my RMSE results to >> match that of Mahout's. Overall interestingly the wall clock performance is >> not as dissimilar as I would have expected. But I would like to now do some >> larger-scale tests on a cluster to really do a good comparison. >> >> Obviously with Spark's Block ALS model, my version is now somewhat >> superfluous since I expect (and have so far seen in my simple local >> experiments) that the block model will significantly outperform. I will >> probably be porting my use case over to this in due time once I've done >> further testing. >> >> I also found Breeze to be very nice to work with and like the DSL - hence >> my question about why not use that? (Especially now that Breeze is actually >> just breeze-math and breeze-viz). >> >> Anyway, I then added KMeans (basically just the Spark example with some >> Breeze tweaks), and started working on a Linear Model framework. I've also >> added a simple framework for arg parsing and config (using Twitter >> Algebird's Args and Typesafe Config), and have started on feature >> extraction stuff - of particular interest will be text feature extraction >> and feature hashing. >> >> This is roughly the idea for a machine learning library on Spark that I >> have - call it a design or manifesto or whatever: >> >> - Library available and consistent across Scala, Java and Python (as much >> as possible in any event) >> - A core library and also a set of stuff for easily running models based on >> standard input formats etc >> - Standardised model API (even across languages) to the extent possible. >> I've based mine so far on Python's scikit-learn (.fit(), .predict() etc). >> Why? I believe it's a major strength of scikit-learn, that its API is so >> clean, simple and consistent. Plus, for the Python version of the lib, >> scikit-learn will no doubt be used wherever possible to avoid re-creating >> code >> - Models to be included initially: >> - ALS >> - Possibly co-occurrence recommendation stuff similar to Mahout's Taste >> - Clustering (K-Means and others potentially) >> - Linear Models - the idea here is to have something very close to Vowpal >> Wabbit, ie a generic SGD engine with various Loss Functions, learning rate >> paradigms etc. Furthermore this would allow other models similar to VW such >> as online versions of matrix factorisation, neural nets and learning >> reductions >> - Possibly Decision Trees / Random Forests >> - Some utilities for feature extraction (hashing in particular), and to >> make running jobs easy (integration with Spark's ./run etc?) >> - Stuff for making pipelining easy (like scikit-learn) and for doing things >> like cross-validation in a principled (and parallel) way >> - Clean and easy integration with Spark Streaming for online models (e.g. a >> linear SGD can be called with fit() on batch data, and then fit() and/or >> fit/predict() on streaming data to learn further online etc). >> - Interactivity provided by shells (IPython, Spark shell) and also plotting >> capability (Matplotlib, and Breeze Viz) >> - For Scala, integration with Shark via sql2rdd etc. >> - I'd like to create something similar to Scalding's Matrix API based on >> RDDs for representing distributed matrices, as well as integrate the ideas >> of Dmitry and Mahout's DistributedRowMatrix etc >> >> Here is a rough outline of the model API I have used at the moment: >> https://gist.github.com/MLnick/6068841. This works nicely for ALS, >> clustering, linear models etc. >> >> So as you can see, mostly overlapping with what MLlib already has or has >> planned in some way, but my main aim is frankly to have consistency in the >> API, some level of abstraction but to keep things as simple as possible (ie >> let Spark handle the complex stuff), and thus hopefully avoid things >> becoming just a somewhat haphazard collection of models that is not that >> simple to figure out how to use - which is unfortunately what I believe has >> happened to Mahout. >> >> So the question then is, how to work together or integrate? I see 3 options: >> >> 1. I go my own way (not very appealing obviously) >> 2. Contribute what I have (or as much as makes sense) to MLlib >> 3. Create my project as a "front-end" or "wrapper" around MLlib as the >> core, effectively providing the API and workflow interface but using MLlib >> as the model engine. >> >> #2 is appealing but then a lot depends on the API and framework design and >> how much what I have in mind is compatible with the rest of the devs etc >> #3 now that I have written it, starts to sound pretty interesting - >> potentially we're looking at a "front-end" that could in fact execute >> models on Spark (or other engines like Hadoop/Mahout, GraphX etc), while >> providing workflows for pipelining transformations, feature extraction, >> testing and cross-validation, and data viz. >> >> But of course #3 starts sounding somewhat like what MLBase is aiming to be >> (I think)! >> >> At this point I'm willing to show out what I have done so far on a >> selective basis - be warned though it is rough and not finished and >> somewhat clunky perhaps as it's my first attempt at a library/framework, if >> it makes sense. Especially because really the main thing I did was the ALS >> port, and with MLlib's version of ALS that may be less useful now in any >> case. >> >> It may be that none of this is that useful to others anyway which is fine >> as I'll keep developing tools that I need and potentially they will be >> useful at some point. >> >> Thoughts, feedback, comments, discussion? I really want to jump into MLlib >> and get involved in contributing to standardised machine learning on Spark! >> >> Nick >
