Dmitriy,

I share a lot your concerns expressed here. I hear more complaints about Mahout being too inaccessible and too hard to customize for use cases and inputs more than complaints about it being too slow. I also concur with your analysis that the clear and accessible programming model is what causes Spark's popularity.

I'm also not a fan of sacrificing a programming model for performance, I also consider this the main drawback of Graphlab. Its superfast for a certain set of problems, but it constrains you to a vertex centric programming model, into which a lot of things hardly fit.



On 03/14/2014 03:21 PM, Dmitriy Lyubimov wrote:
I think that the proposal under discussion involves adding a dependency on
a maven released h2o artifact plus a contribution of Mahout translation
layers.  These layers would give a sub-class of Matrix (and Vector) which
allow direct control over life span across multiple jobs but would
otherwise behave like their in-memory counter-parts.

Well I suppose that means they have to live in some processes which are not
processes I already have. And they have to be managed. So this is not just
an in-core subsystem. Sounds like a new back to me.


In Hadoop, every iteration must be scheduled as a separate job, rereads
invariant data and materializes its result to hdfs. Therefore, iterative
programs on Hadoop are an order of magnitude slower than on systems that
have dedicated support for iterations.

Does h2o help here or would we need to incorporate another system for
such
tasks?


H2o helps here in a couple of different ways.

The first and foremost is that primitive operations are easy
Additionally, data elements can survive a single programs execution.  This
means that programs can be executed one after another to get composite
effects.  This is astonishingly fast ... more along the speeds one would
expect from a single processor program.

I think the problem here is that the authors keep comparing these
techniques to slowest model available which is Hadoop.

But this is exact execution model of Spark. You get stuff repeatedly
executed on in-memory partitions and get approximately the speed of
iterative speed execution.  I won't describe it as astonishing, though,
because indeed it is as fast as you can get things done in memory, no
faster, no slower. That's for example the reason why my linalg optimizer is
not hesitating to compute exact matrix geometry lazily if not known, for
optimization purposes, because the answer will be back in between 40 to 200
ms, assuming adequate RAM allocation. I have been using these paradigms for
more than a year now. This is all good stuff. I would not use word
astonshing, but sensible, yes. Main concern is if programming model is
called to be sacrificed just to do sensible things here.



(2) Efficient join implementations

If we look at a lot of Mahout's algorithm implementations with a
database
hat on, than we see lots of handcoded joins in our codebase, because
Hadoop
does not bring join primitives. This has lots of drawbacks, e.g. it
complicates the codebase and leads to hardcoded join strategies that
bake
certain assumptions into the code (e.g. ALS uses a broadcast-join which
assumes that one side fits into memory on each machine, RecommenderJob
uses
a repartition-join which is scalable but very slow for small
inputs,...).


+1

I think that h2o provides this but do not know in detail how.  I do know
that many of the algorithms already coded make use of matrix
multiplication
which is essentially a join operation.

Essentially a join? The spark module optimizer picks out of at least 3
implementations: zip+combine, block-wise cartesian and finally, yes,
join+combine. Depends on orientation and the earlier operators in pipeline.
That's exactly my point about flexibility of programming model from the
optimizer point of view.


Obviously, I'd love to get rid of handcoded joins and implement ML
algorithms (which is hard enough on its own). Other systems help with
this
already. Spark, for example offers broadcast and repartition-join
primitives, Stratosphere has a join primitive and an optimizer that
automatically decides which join strategy to use, as well as a highly
optimized hybrid hashjoin implementation that can gracefully go
out-of-core
under memory pressure.


When you get into the realm of things on this level of sophistication, I
think that you have found the boundary where alternative foundations like
Spark and Stratosphere are better than h2o.  The novelty with h2o is the
hypothesis that a very large fraction of interesting ML algorithms can be
implemented without this power.  So far, this seems correct.

Again, this is largely along the lines "let's make a library of few
hand-optimized things". Which is noble, but -- I would argue -- not
ambitious enough. Most of the distributed ML projects do just that. We
should perhaps think along the lines what could be differentiating factor
for us.

Not that we should not care about performance. It should be, of course,
*sensible*. (Our MR code base of course does not give us that, as u said,
jumping off MR wagon is not even a question).

If you can forgive me for drawing parallels here, it's a difference between
something like Weka and R. Collection vs. platform _and_ collection induced
by platform. Platform of course also positively feeds into the speed of
collection growth directly.

When i use R, i don't have code consisting of algorithms calls. That is,
yes, it is doing off-the shelf use now and then, but it is far from being
the only thing  it is doing. 95% of the things is as simple feature
massaging. I place no value in R for providing GLM for me. Gosh, this
particular offering is virtually hanging from anywhere these days.

But i do place value into it for doing custom feature prep and for, for
example being able to get 100 grad students to try their own k-means
implementation in seconds.

Why?

There has been a lot of talk here about building community and
contributions etc. Platform is what builds it, most directly and amazingly.
I would go on a limb here and say that Spark and mlib are experiencing
explosive growth of contributions not because it can do things with
in-memory datasets (which is important, but like i said, is has been long
since viewed no more than just sensible), but because of clarity of its
programming model. I think we have seen a very solid evidence that clarity
and richness of programming model was the thing that attracts communities.

If we grade roughly (very roughly!) what we have today, I can easily argue
that the acceptance levels follow the programming model very closely. e.g.
if i try to sort project with distributed programming models by (my
subjectively percieved) popularity, from bottom to top :

********

Hadoop MapReduce -- ok i don't even know how to organize the critique here,
too long of a list, almost nobody (but Mahout) does these things this way
today. Certainly, none of my last 2 employers did.

hive -- SQL like with severly constrained general programming language
capabilities, not conducive to batches. Pretty much limits to ad-hoc
exploration.

Pig -- a bit better, can write batches, but extra functionality mixins
(UDFs) are still a royal pain

Cascading -- even easier, rich primitives, easy batches, some manual
optimization of physical plan elements. One of the big cons is the
limitation of a rigid dataset tuple structure,

FlumeJava (Crunch in apache world) -- even better, but java closures are
just plain ugly, zero "scriptability". Its community has been hurt a little
bit because of the fact that it was a bit late to the show compared to
others (e.g. cascading), but it leveled off quickly.

Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell, well
better on the closure and FP front! But still not being native to scala
from get go creates some miniature problems there.

Spark -- i think is fair to say  the current community "king" above those
all -- all the aforementioned platform model pains are eliminated, although
on performance side i think there're still some pockets for improvement on
cost-based optimization side of things.

Stratosphere might be more interesting in this department, but I am not
sure at this point if that necessarily will translate into performance
benefits for ML.

********

The first few things are using the same computing model underneath and
essentially are having roughly the same performance. Yet there's clear
variation in community and acceptance.

In ML world, we are seeing approximately the same thing. The clearer the
programming model and ease of integration in to the process, the wider the
acceptance. I probably can pretty successfully argue that current most
performant ML "thing" as it stands is GraphLab. And it is pretty
comprehensive in problem coverage (I think it does cover e.g. recommender
concerns greater than h2o and Mahout together, for example). But i can also
pretty successfully argue it is being rejected a lot of time for being just
a collection (which is, in addition, is hard to call from jvm, i.e.
integration again). It is actually so bad, that people in my company would
rather go back to 20 snow wired R servers than think of even entertaining
an architecture including GraphLab component. (Yes, variance of this sample
as high as it gets, just saying what i hear).

So as a general guideline to solve the current ills, it would stand to
reason to adopt platform priority and algorithm collection as a function of
such platform, rather than collection as a function of few dedicated
efforts. Yes -- it has to be *sensibly* performant -- but this does not
have to be mostly a concern of the code in this project directly. Rather,
it has to be a concern of the backs (i.e. dependencies) and our in-core
support.

Our pathological fear of being a performance scapegoat totally obscurs the
fact that performance is mostly a function of the back and that we were
riding on a wrong back for a long time. As long as we don't cling to a
particular back, it shouldn't be a problem. What one would rather accept:
being initially 5x slower than Graphlab (but on par with MLlib) but beat
these on community support, or being on par but anemic in community? If 02
platform feels the performance has been so important to sacrifice
programming model, why they feel the need to join an apache project? After
all, they have been an open project for a long time already and have built
their own community, big or small. Spark has just now become a top-level
apache project, and joined apache incubator mere 2 months ago and did not
have any trouble attracting community outside Apache at all. Stratosphere
is not even in Apache. Similarly, did it help Mahout to be in Apache to get
anywhere close in community measurement to these? So this totally refutes
the argument one has to be an Apache project to get its exclusive qualities
highlighted. Perhaps in the end it is more about the importance of the
qualities to the community and quality of contributions.

A lot of this platform and programming model priority is probably easier to
say than do, but some of linalg and data frame things are ridiculously easy
though in terms of amount of effort. If i could do linalg optmizer with
bindings for sparks with 2 nights a month, the same can be done for
multiple backs and data frames in a jiffy. Well, the back should have a
clear programming model of course as a prerequisite. Which brings us back
to the issue of richness of distributed primitives.


Reply via email to