Re: Discussion Of ML environment/MR, Mahout

Dmitriy Lyubimov Wed, 13 Mar 2013 05:13:54 -0700

On Mar 13, 2013 1:02 AM, "Nick Pentreath" <[email protected]> wrote:
>


> There are definitely plenty of rough edges still (the Spark team needs
your
> help to find them and smooth them out!) - especially around tooling vis a
> vis the Hadoop distros like CDH / MapR / Hortonworks. I've not tested YARN
> or Mesos full deployment, but I have tested the standalone deploy on a
CDH4
> cluster and it works very well. There's also a bootstrap action for
> Spark/Shark on Elastic Mapreduce (
> http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923) so you
> can run them on AWS together (which opens up interesting possibilities to
> use within AWS Data Pipelines).

Yes. I found some runtime dependency clashes with hbase in cdh4 which
require tedious filtering  and some hive compatibility issues in shark but
nothing operationally blocking. Yarn option is not where it should be, yet,
imo.

> The main point of interest in this context is that I intend to build a
> minimal first-cut machine learning library for Spark. This is likely to
> involve porting / using parts of Mahout where it makes sense (or at the
> very least taking major cues from Mahout implementations, as well as other
> ML libraries). On the Java side it's also highly probable it would use
> mahout-math (although there are some other options).

My thoughts exactly.

>
> I hope to start getting into this properly over the next couple of months.
> So if anyone is interested on collaborating, I would very much welcome
> help, discussion, idea-bouncing, etc.

I am interested. Not immediately but in a couple of month i will need to
get on some collaborative stuff. I appreciate a github link if you decide
to publish anything.

D

>
> Nick
>
> On Wed, Mar 13, 2013 at 4:04 AM, Dmitriy Lyubimov <[email protected]>
wrote:
>
> > Hi Josh,
> >
> >
> > On Tue, Mar 12, 2013 at 3:05 PM, Josh Wills <[email protected]> wrote:
> >
> > > Hey Mahout Devs,
> > >
> > >
> > > First, I wanted to say that I think that there are lots of problems
that
> > > can be handled well in MapReduce (the recent k-means streaming stuff
> > being
> > > a prime example), even if they could be performed even faster using an
> > > in-memory model.
> >
> >
> > Yeah. The stuck point for me is page-rankish-finding stationary
> > distributions and extremely popular ALS based stuff. We've beaten the
heck
> > out of it a year ago and Sebastian conclusively stated Giraph ALS knocks
> > the socks off MR version. Add to that a bisect search for a good
> > regularization, and number of iterations needed multiply -- that's one
> > thing we actually eventually dropped completely as far as i can remember
> > for that very reason. Parallel search is also possible but it would
require
> > far more crunching than the bisect one.
> >
> > Even with shark i am able to do ETL like things under 5s on a dataset
that
> > the same hive query takes minutes to complete. There's absolutely no
> > discussion about that for some of our exploratory requirements MR just
> > doesn't cut it. Especially for problems with high number of steps, well
> > defined blocking structure and interconnectedness.
> >
> > Also. Batch stuff is 95% ETL / variable prep. There's absolutely no
> > argument that batch systems such as Crunch and Cascading and Pig will
stay
> > absolutely valuable in that problem space.
> >
> > If we throw away those 95% however, and keep realtively condensed ML
> > problems, this at least for 50% of real life problems lands at medium
size,
> > cpu-bound datasets with high interconnectedness and high amount
supersteps.
> > Just as Ted mentioned, i found it to be it increasingly pragmatically
the
> > case. I had to eat a lot of dung with SSVD in order to manage the
problem
> > of interconnectedness and fit it into small information pieces so that
> > side-load is not apparent at scale, but even that approach still has it
and
> > will undoubtedly run more efficiently once combined with
> > scatter/distributed memory techniques.
> >
> > I found some operational problems with spark (notably, task cleanup in
> > workers), although it is very likely to be me somehow being stupid or
> > running a lot of code that breaks. But i surprisingly found myself
swearing
> > a lot less than i did so about some of hadoop ecosystem products
> > operationally. I  have still not had production trial though to be
honest.
> > We'll see.
> >
> >
> > >
> > > I'm wondering if we could solve both problems by creating a wrapper
API
> > > that would look a lot like the Spark RDD API and then created
> > > implementations of that API via Spark/Giraph/Crunch (truly shameless
> > > promotion: http://crunch.apache.org/ ). That way, the same model could
> > be
> > > run in-memory against Spark or in batch via a series of MapReduce jobs
> > (or
> > > a BSP job, or a Tez pipeline, or whatever execution framework is
written
> > > next week.)
> >
> >
> > I considered similar things before. One problem with this approach the
way
> > i see it is that usually you have to create some sort of format
bridging,
> > or at the very least persist the results between changing approach
gears.
> >
> > One of the nicest thing about spark is that datasets are already
> > partitioned in memory, not going anywhere, and ready to launch either
> > superstep or mapper the next nanosecond without any additional ado.They
are
> > giving it all in the same giftwrap. And yes, there's this scatter thing
vs.
> > forced sorting thing, but oh well. And yes they are just as
operationally
> > compatible with hadoop stuff as regular MR. Same hdfs location
affinity. I
> > just recently finished a parallel RDD  pulling simultaneously data from
> > hbase corprocessor-based spatial scan with the whole region server
location
> > affinity thing, very easy to integrate.
> >
> > (well, easy integrate on spark side; the endpoint coprocessor api on
> > pre-0.96 hbase is woefully lacking streaming, so.. but even then... )
> >
> >
> > > The main virtue of Crunch in this regard is that the data model
> > > is very similar to Spark's (truth be told, I used the Spark API as a
> > > reference when I was originally creating Crunch's Scala API,
"Scrunch")--
> > > the whole idea of distributed collections of "things," where things
can
> > be
> > > anything you want them to be (e.g., Mahout Vectors.)
> > >
> >
> > My opinion is that Crunch is no better or worse that RDD api but lacks
bulk
> > parallel operations (as it stands today obviously). But argument about
> > subset similarity is far from an argument about superiority. FWIW i
believe
> > Crunch api is far better than Cascading for integrating with Mahout's
> > vectors (and better suited for custom type integration such as R types)
> > which is why i was, and still is, interested in that direction; but
spark
> > adoption is at the very least just as easy.
> >
> > > I don't have an opinion on the structure of such an effort (via the
> > > incubator or otherwise), but I thought I would throw the idea out
there,
> > as
> > > it's something I would definitely like to be involved in.
> >
> > I support to extend Mahout to any end (on the side first, module-level
> > isolation). In the end everything that is added,  either wins by being
> > used, or it will stagnate, but without fresh attempts it is hard to
expect
> > any results at all. Even  sheer  fact of this discussion existence is a
> > testament that Mahout may need to try something new operationally to
> > maintain and extend its acceptance. But i think the agreement has always
> > been to maintain high entry criteria for the new stuff with a good usage
> > prospects. Just adding stuff for the sake of adding stuff to the
mainstream
> > seemed to have a trend to dissolve the project cohesiveness in the past.
> >
> > -d
> >

Re: Discussion Of ML environment/MR, Mahout

Reply via email to