Re: Mahout DSL vs Spark

Dmitriy Lyubimov Sun, 27 Apr 2014 20:15:07 -0700

PS.

and i am not favoring "dsl" as a name. too loose.


i'd say "spark" stays "spark" and the abstract bindings would probably go
into "math-scala". no reason to create yet another module IMO.


On Sun, Apr 27, 2014 at 8:10 PM, Dmitriy Lyubimov <[email protected]> wrote:

>
>
>
>
>
>
> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote:
>
>> Hi Ted, Dmitry,
>> Background: I am exploring the feasibility of providing H2O distributed
>> "backend" to the DSL.
>>
>
> Very cool. that's actually was one of my initial proposals on how to
> approach this. Got pushed back on this though.
>
>
>> At a high level it appears that implementing physical operators for
>> DrmLike over H2O does not seem extremely challenging. All the operators in
>> the DSL seem to have at least an approximate equivalent in H2O's own
>> (R-like) DSL, and wiring one operator with another's implementation seems
>> like a tractable problem.
>>
>
> It should be tractable, sure, even for map reduce. The question is whether
> there's enough diversity to do certain optimizations in a certain way. E.g.
> if two matrices are identically partitioned, then do map-side zip instead
> of actual parallel join etc.
>
> But it should be tractable, indeed.
>
>
>> The reason I write, is to better understand the split between the Mahout
>> DSL and Spark (both current and future). As of today, the DSL seems to be
>> pretty tightly coupled with Spark.
>>
>> E.g:
>>
>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>
>
> This is a known thing, I think i noted it somewhere in jira. That, and rdd
> property of CheckpointedDRM. This needs to be abstracted away.
>
>
>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>> DrmRddInput (instead of, say, DrmLike)
>>
>
> CheckpointAction is part of physical layer. This is something that would
> have to be completely re-written for a new engine. This is the "plugin"
> api, but it is never user-facing (logical plan facing).
>
>
>> Firstly, I don't think I am presenting some new revelation you guys don't
>> already know - I'm sure you know that the logical vs physical "split" in
>> the DSL is not absolute (yet).
>>
>
> Aha. Exactly
>
>
>>
>> That being said, I would like to understand if there are plans, or
>> efforts already underway to make the DSL (i.e how DSSVD would be written)
>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move
>> the Spark specific code entirely into the physical domain. I recall Dmitry
>> mentioning that a new engine other than Spark was also being planned,
>> therefore I deduce some thought for such "purification" has already been
>> applied.
>>
>
> Aha. The hope is for Stratosphere. But there are few items that need to be
> done by Stratosphere folks before we can leverage it fully. Or, let's say,
> leverage it much better than we otherwise could. Makes sense to wait a bit.
>
>
>>
>> It would be nice to see changes approximately like:
>>
>> Rename ./spark => ./dsl
>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>> ./dsl/src/main/scala/org/apache/mahout/dsl
>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>
>
> i was thinking along the lines factoring out public traits and logical
> operators (DRMLike etc.)  out of spark module into independent module
> without particular engine dependencies. Exactly. It just hasn't come to
> that yet.
>
>
>> along with appropriately renaming packages and imports, and confining
>> references to RDD and SparkContext completely within spark-backend.
>>
>> I think such a clean split would be necessary to introduce more backend
>> engines. If no efforts are already underway, I would be glad to take on the
>> DSL "purification" task.
>>
>
> i think you got very close to my thinking about further steps here. Like i
> said, i was just idling in wait for something like Stratosphere to become
> closer to our orbit.
>
>>
>> Thanks,
>> Avati
>>
>>
>>
>>
>
>

Re: Mahout DSL vs Spark

Reply via email to