Hi Josh,

On Tue, Mar 12, 2013 at 3:05 PM, Josh Wills <[email protected]> wrote:

> Hey Mahout Devs,
>
>
> First, I wanted to say that I think that there are lots of problems that
> can be handled well in MapReduce (the recent k-means streaming stuff being
> a prime example), even if they could be performed even faster using an
> in-memory model.


Yeah. The stuck point for me is page-rankish-finding stationary
distributions and extremely popular ALS based stuff. We've beaten the heck
out of it a year ago and Sebastian conclusively stated Giraph ALS knocks
the socks off MR version. Add to that a bisect search for a good
regularization, and number of iterations needed multiply -- that's one
thing we actually eventually dropped completely as far as i can remember
for that very reason. Parallel search is also possible but it would require
far more crunching than the bisect one.

Even with shark i am able to do ETL like things under 5s on a dataset that
the same hive query takes minutes to complete. There's absolutely no
discussion about that for some of our exploratory requirements MR just
doesn't cut it. Especially for problems with high number of steps, well
defined blocking structure and interconnectedness.

Also. Batch stuff is 95% ETL / variable prep. There's absolutely no
argument that batch systems such as Crunch and Cascading and Pig will stay
absolutely valuable in that problem space.

If we throw away those 95% however, and keep realtively condensed ML
problems, this at least for 50% of real life problems lands at medium size,
cpu-bound datasets with high interconnectedness and high amount supersteps.
Just as Ted mentioned, i found it to be it increasingly pragmatically the
case. I had to eat a lot of dung with SSVD in order to manage the problem
of interconnectedness and fit it into small information pieces so that
side-load is not apparent at scale, but even that approach still has it and
will undoubtedly run more efficiently once combined with
scatter/distributed memory techniques.

I found some operational problems with spark (notably, task cleanup in
workers), although it is very likely to be me somehow being stupid or
running a lot of code that breaks. But i surprisingly found myself swearing
a lot less than i did so about some of hadoop ecosystem products
operationally. I  have still not had production trial though to be honest.
We'll see.


>
> I'm wondering if we could solve both problems by creating a wrapper API
> that would look a lot like the Spark RDD API and then created
> implementations of that API via Spark/Giraph/Crunch (truly shameless
> promotion: http://crunch.apache.org/ ). That way, the same model could be
> run in-memory against Spark or in batch via a series of MapReduce jobs (or
> a BSP job, or a Tez pipeline, or whatever execution framework is written
> next week.)


I considered similar things before. One problem with this approach the way
i see it is that usually you have to create some sort of format bridging,
or at the very least persist the results between changing approach gears.

One of the nicest thing about spark is that datasets are already
partitioned in memory, not going anywhere, and ready to launch either
superstep or mapper the next nanosecond without any additional ado.They are
giving it all in the same giftwrap. And yes, there's this scatter thing vs.
forced sorting thing, but oh well. And yes they are just as operationally
compatible with hadoop stuff as regular MR. Same hdfs location affinity. I
just recently finished a parallel RDD  pulling simultaneously data from
hbase corprocessor-based spatial scan with the whole region server location
affinity thing, very easy to integrate.

(well, easy integrate on spark side; the endpoint coprocessor api on
pre-0.96 hbase is woefully lacking streaming, so.. but even then... )


> The main virtue of Crunch in this regard is that the data model
> is very similar to Spark's (truth be told, I used the Spark API as a
> reference when I was originally creating Crunch's Scala API, "Scrunch")--
> the whole idea of distributed collections of "things," where things can be
> anything you want them to be (e.g., Mahout Vectors.)
>

My opinion is that Crunch is no better or worse that RDD api but lacks bulk
parallel operations (as it stands today obviously). But argument about
subset similarity is far from an argument about superiority. FWIW i believe
Crunch api is far better than Cascading for integrating with Mahout's
vectors (and better suited for custom type integration such as R types)
which is why i was, and still is, interested in that direction; but spark
adoption is at the very least just as easy.

> I don't have an opinion on the structure of such an effort (via the
> incubator or otherwise), but I thought I would throw the idea out there,
as
> it's something I would definitely like to be involved in.

I support to extend Mahout to any end (on the side first, module-level
isolation). In the end everything that is added,  either wins by being
used, or it will stagnate, but without fresh attempts it is hard to expect
any results at all. Even  sheer  fact of this discussion existence is a
testament that Mahout may need to try something new operationally to
maintain and extend its acceptance. But i think the agreement has always
been to maintain high entry criteria for the new stuff with a good usage
prospects. Just adding stuff for the sake of adding stuff to the mainstream
seemed to have a trend to dissolve the project cohesiveness in the past.

-d

Reply via email to