Re: Discussion Of ML environment/MR, Mahout

Nick Pentreath Wed, 13 Mar 2013 01:02:11 -0700

I for one have been using Spark extensively for the past few months
- admittedly not in full production, mostly for testing, prototyping - and
love it. Over the past two releases they have come a huge way, adding a
near-complete Python API and Spark Streaming (comparable to Storm - well
more like the Trident abstraction actually).


There are definitely plenty of rough edges still (the Spark team needs your
help to find them and smooth them out!) - especially around tooling vis a
vis the Hadoop distros like CDH / MapR / Hortonworks. I've not tested YARN
or Mesos full deployment, but I have tested the standalone deploy on a CDH4
cluster and it works very well. There's also a bootstrap action for
Spark/Shark on Elastic Mapreduce (
http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923) so you
can run them on AWS together (which opens up interesting possibilities to
use within AWS Data Pipelines).

Anyway I've decided to bet the farm on Spark for my new project (nothing to
see here yet, move along) - for whatever that's worth :).

The main point of interest in this context is that I intend to build a
minimal first-cut machine learning library for Spark. This is likely to
involve porting / using parts of Mahout where it makes sense (or at the
very least taking major cues from Mahout implementations, as well as other
ML libraries). On the Java side it's also highly probable it would use
mahout-math (although there are some other options).

I hope to start getting into this properly over the next couple of months.
So if anyone is interested on collaborating, I would very much welcome
help, discussion, idea-bouncing, etc.

Nick

On Wed, Mar 13, 2013 at 4:04 AM, Dmitriy Lyubimov <[email protected]> wrote:

> Hi Josh,
>
>
> On Tue, Mar 12, 2013 at 3:05 PM, Josh Wills <[email protected]> wrote:
>
> > Hey Mahout Devs,
> >
> >
> > First, I wanted to say that I think that there are lots of problems that
> > can be handled well in MapReduce (the recent k-means streaming stuff
> being
> > a prime example), even if they could be performed even faster using an
> > in-memory model.
>
>
> Yeah. The stuck point for me is page-rankish-finding stationary
> distributions and extremely popular ALS based stuff. We've beaten the heck
> out of it a year ago and Sebastian conclusively stated Giraph ALS knocks
> the socks off MR version. Add to that a bisect search for a good
> regularization, and number of iterations needed multiply -- that's one
> thing we actually eventually dropped completely as far as i can remember
> for that very reason. Parallel search is also possible but it would require
> far more crunching than the bisect one.
>
> Even with shark i am able to do ETL like things under 5s on a dataset that
> the same hive query takes minutes to complete. There's absolutely no
> discussion about that for some of our exploratory requirements MR just
> doesn't cut it. Especially for problems with high number of steps, well
> defined blocking structure and interconnectedness.
>
> Also. Batch stuff is 95% ETL / variable prep. There's absolutely no
> argument that batch systems such as Crunch and Cascading and Pig will stay
> absolutely valuable in that problem space.
>
> If we throw away those 95% however, and keep realtively condensed ML
> problems, this at least for 50% of real life problems lands at medium size,
> cpu-bound datasets with high interconnectedness and high amount supersteps.
> Just as Ted mentioned, i found it to be it increasingly pragmatically the
> case. I had to eat a lot of dung with SSVD in order to manage the problem
> of interconnectedness and fit it into small information pieces so that
> side-load is not apparent at scale, but even that approach still has it and
> will undoubtedly run more efficiently once combined with
> scatter/distributed memory techniques.
>
> I found some operational problems with spark (notably, task cleanup in
> workers), although it is very likely to be me somehow being stupid or
> running a lot of code that breaks. But i surprisingly found myself swearing
> a lot less than i did so about some of hadoop ecosystem products
> operationally. I  have still not had production trial though to be honest.
> We'll see.
>
>
> >
> > I'm wondering if we could solve both problems by creating a wrapper API
> > that would look a lot like the Spark RDD API and then created
> > implementations of that API via Spark/Giraph/Crunch (truly shameless
> > promotion: http://crunch.apache.org/ ). That way, the same model could
> be
> > run in-memory against Spark or in batch via a series of MapReduce jobs
> (or
> > a BSP job, or a Tez pipeline, or whatever execution framework is written
> > next week.)
>
>
> I considered similar things before. One problem with this approach the way
> i see it is that usually you have to create some sort of format bridging,
> or at the very least persist the results between changing approach gears.
>
> One of the nicest thing about spark is that datasets are already
> partitioned in memory, not going anywhere, and ready to launch either
> superstep or mapper the next nanosecond without any additional ado.They are
> giving it all in the same giftwrap. And yes, there's this scatter thing vs.
> forced sorting thing, but oh well. And yes they are just as operationally
> compatible with hadoop stuff as regular MR. Same hdfs location affinity. I
> just recently finished a parallel RDD  pulling simultaneously data from
> hbase corprocessor-based spatial scan with the whole region server location
> affinity thing, very easy to integrate.
>
> (well, easy integrate on spark side; the endpoint coprocessor api on
> pre-0.96 hbase is woefully lacking streaming, so.. but even then... )
>
>
> > The main virtue of Crunch in this regard is that the data model
> > is very similar to Spark's (truth be told, I used the Spark API as a
> > reference when I was originally creating Crunch's Scala API, "Scrunch")--
> > the whole idea of distributed collections of "things," where things can
> be
> > anything you want them to be (e.g., Mahout Vectors.)
> >
>
> My opinion is that Crunch is no better or worse that RDD api but lacks bulk
> parallel operations (as it stands today obviously). But argument about
> subset similarity is far from an argument about superiority. FWIW i believe
> Crunch api is far better than Cascading for integrating with Mahout's
> vectors (and better suited for custom type integration such as R types)
> which is why i was, and still is, interested in that direction; but spark
> adoption is at the very least just as easy.
>
> > I don't have an opinion on the structure of such an effort (via the
> > incubator or otherwise), but I thought I would throw the idea out there,
> as
> > it's something I would definitely like to be involved in.
>
> I support to extend Mahout to any end (on the side first, module-level
> isolation). In the end everything that is added,  either wins by being
> used, or it will stagnate, but without fresh attempts it is hard to expect
> any results at all. Even  sheer  fact of this discussion existence is a
> testament that Mahout may need to try something new operationally to
> maintain and extend its acceptance. But i think the agreement has always
> been to maintain high entry criteria for the new stuff with a good usage
> prospects. Just adding stuff for the sake of adding stuff to the mainstream
> seemed to have a trend to dissolve the project cohesiveness in the past.
>
> -d
>

Re: Discussion Of ML environment/MR, Mahout

Reply via email to