Re: Mahout DSL vs Spark

Sebastian Schelter Sun, 27 Apr 2014 23:12:25 -0700

Anand,

I'd also love to see work on a cleaner separation between the DSL andSpark. Another thing that should be tackled in the current code is thatthe SparkContext has to be present as implicit val in some methods.

Making the DSL run on different systems will be a powerful feature thatwill make Mahout unique and attractive to a lot of users, as it doesn'tenforce a lock-in to a particular system. I've talked to a companyrecently that exactly had this requirement, they decided against usingSpark, but would still be highly interested in running new Mahoutrecommenders built using the DSL.


--sebastian


On 04/28/2014 05:39 AM, Anand Avati wrote:

On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]> wrote:




On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote:

Hi Ted, Dmitry,
Background: I am exploring the feasibility of providing H2O distributed
"backend" to the DSL.


Very cool. that's actually was one of my initial proposals on how to
approach this. Got pushed back on this though.


We are exploring various means of integration. The Jira mentioned providing
Matrix and Vector implementations as an initial exploration. That task by
itself had a lot of value in terms of reconciling some ground level issues
(build/mvn compatibility, highlighting some classloader related challenges
etc. on the H2O side.) Plugging behind a common DSL makes sense, though
there may be value in other points of integration too, to exploit H2O's
strengths.

At a high level it appears that implementing physical operators for
DrmLike over H2O does not seem extremely challenging. All the operators in
the DSL seem to have at least an approximate equivalent in H2O's own
(R-like) DSL, and wiring one operator with another's implementation seems
like a tractable problem.


It should be tractable, sure, even for map reduce. The question is whether
there's enough diversity to do certain optimizations in a certain way. E.g.
if two matrices are identically partitioned, then do map-side zip instead
of actual parallel join etc.

But it should be tractable, indeed.



Yes, H2O has ways to do such things - a single map/reduce task on two
matrices "side by side" which are similarly partitioned (i.e, sharing the
same VectorGroup in H2O terminology)

The reason I write, is to better understand the split between the Mahout

DSL and Spark (both current and future). As of today, the DSL seems to be
pretty tightly coupled with Spark.

E.g:

- DSSVD.scala imports o.a.spark.storage.StorageLevel


This is a known thing, I think i noted it somewhere in jira. That, and rdd
property of CheckpointedDRM. This needs to be abstracted away.

- drm.plan.CheckpointAction: the result of exec() and checkpoint() is
DrmRddInput (instead of, say, DrmLike)


CheckpointAction is part of physical layer. This is something that would
have to be completely re-written for a new engine. This is the "plugin"
api, but it is never user-facing (logical plan facing).


It somehow felt that the optimizer was logical-ish. Do you mean the
optimizations in CheckpointAction are specific to Spark and cannot be
inherited in general to other backends (not that I think that is wrong)?

Firstly, I don't think I am presenting some new revelation you guys don't
already know - I'm sure you know that the logical vs physical "split" in
the DSL is not absolute (yet).


Aha. Exactly


That being said, I would like to understand if there are plans, or
efforts already underway to make the DSL (i.e how DSSVD would be written)
and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move
the Spark specific code entirely into the physical domain. I recall Dmitry
mentioning that a new engine other than Spark was also being planned,
therefore I deduce some thought for such "purification" has already been
applied.


Aha. The hope is for Stratosphere. But there are few items that need to be
done by Stratosphere folks before we can leverage it fully. Or, let's say,
leverage it much better than we otherwise could. Makes sense to wait a bit.


It would be nice to see changes approximately like:

Rename ./spark => ./dsl
Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
./dsl/src/main/scala/org/apache/mahout/dsl
Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
./dsl/main/scala/org/apache/mahout/dsl/spark-backend


i was thinking along the lines factoring out public traits and logical
operators (DRMLike etc.)  out of spark module into independent module
without particular engine dependencies. Exactly. It just hasn't come to
that yet.

along with appropriately renaming packages and imports, and confining
references to RDD and SparkContext completely within spark-backend.

I think such a clean split would be necessary to introduce more backend
engines. If no efforts are already underway, I would be glad to take on the
DSL "purification" task.


i think you got very close to my thinking about further steps here. Like i
said, i was just idling in wait for something like Stratosphere to become
closer to our orbit.


OK, I think there is reasonable alignment on the goal. But you were not
clear on whether you are going to be doing the purification split in the
near future? or is that still an "unassigned task" which I can pick up?

Avati

Re: Mahout DSL vs Spark

Reply via email to