Re: Mahout DSL vs Spark

Anand Avati Sun, 27 Apr 2014 20:40:25 -0700

On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]> wrote:


>
>
>
> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote:
>
>> Hi Ted, Dmitry,
>> Background: I am exploring the feasibility of providing H2O distributed
>> "backend" to the DSL.
>>
>
> Very cool. that's actually was one of my initial proposals on how to
> approach this. Got pushed back on this though.
>

We are exploring various means of integration. The Jira mentioned providing
Matrix and Vector implementations as an initial exploration. That task by
itself had a lot of value in terms of reconciling some ground level issues
(build/mvn compatibility, highlighting some classloader related challenges
etc. on the H2O side.) Plugging behind a common DSL makes sense, though
there may be value in other points of integration too, to exploit H2O's
strengths.


>
>
>> At a high level it appears that implementing physical operators for
>> DrmLike over H2O does not seem extremely challenging. All the operators in
>> the DSL seem to have at least an approximate equivalent in H2O's own
>> (R-like) DSL, and wiring one operator with another's implementation seems
>> like a tractable problem.
>>
>
> It should be tractable, sure, even for map reduce. The question is whether
> there's enough diversity to do certain optimizations in a certain way. E.g.
> if two matrices are identically partitioned, then do map-side zip instead
> of actual parallel join etc.
>
> But it should be tractable, indeed.
>


Yes, H2O has ways to do such things - a single map/reduce task on two
matrices "side by side" which are similarly partitioned (i.e, sharing the
same VectorGroup in H2O terminology)



> The reason I write, is to better understand the split between the Mahout
>> DSL and Spark (both current and future). As of today, the DSL seems to be
>> pretty tightly coupled with Spark.
>>
>> E.g:
>>
>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>
>
> This is a known thing, I think i noted it somewhere in jira. That, and rdd
> property of CheckpointedDRM. This needs to be abstracted away.
>
>
>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>> DrmRddInput (instead of, say, DrmLike)
>>
>
> CheckpointAction is part of physical layer. This is something that would
> have to be completely re-written for a new engine. This is the "plugin"
> api, but it is never user-facing (logical plan facing).
>

It somehow felt that the optimizer was logical-ish. Do you mean the
optimizations in CheckpointAction are specific to Spark and cannot be
inherited in general to other backends (not that I think that is wrong)?


>
>> Firstly, I don't think I am presenting some new revelation you guys don't
>> already know - I'm sure you know that the logical vs physical "split" in
>> the DSL is not absolute (yet).
>>
>
> Aha. Exactly
>
>
>>
>> That being said, I would like to understand if there are plans, or
>> efforts already underway to make the DSL (i.e how DSSVD would be written)
>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move
>> the Spark specific code entirely into the physical domain. I recall Dmitry
>> mentioning that a new engine other than Spark was also being planned,
>> therefore I deduce some thought for such "purification" has already been
>> applied.
>>
>
> Aha. The hope is for Stratosphere. But there are few items that need to be
> done by Stratosphere folks before we can leverage it fully. Or, let's say,
> leverage it much better than we otherwise could. Makes sense to wait a bit.
>
>
>>
>> It would be nice to see changes approximately like:
>>
>> Rename ./spark => ./dsl
>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>> ./dsl/src/main/scala/org/apache/mahout/dsl
>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>
>
> i was thinking along the lines factoring out public traits and logical
> operators (DRMLike etc.)  out of spark module into independent module
> without particular engine dependencies. Exactly. It just hasn't come to
> that yet.
>
>
>> along with appropriately renaming packages and imports, and confining
>> references to RDD and SparkContext completely within spark-backend.
>>
>> I think such a clean split would be necessary to introduce more backend
>> engines. If no efforts are already underway, I would be glad to take on the
>> DSL "purification" task.
>>
>
> i think you got very close to my thinking about further steps here. Like i
> said, i was just idling in wait for something like Stratosphere to become
> closer to our orbit.
>

OK, I think there is reasonable alignment on the goal. But you were not
clear on whether you are going to be doing the purification split in the
near future? or is that still an "unassigned task" which I can pick up?

Avati

Re: Mahout DSL vs Spark

Reply via email to