PS. and i am not favoring "dsl" as a name. too loose.
i'd say "spark" stays "spark" and the abstract bindings would probably go into "math-scala". no reason to create yet another module IMO. On Sun, Apr 27, 2014 at 8:10 PM, Dmitriy Lyubimov <[email protected]> wrote: > > > > > > > On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote: > >> Hi Ted, Dmitry, >> Background: I am exploring the feasibility of providing H2O distributed >> "backend" to the DSL. >> > > Very cool. that's actually was one of my initial proposals on how to > approach this. Got pushed back on this though. > > >> At a high level it appears that implementing physical operators for >> DrmLike over H2O does not seem extremely challenging. All the operators in >> the DSL seem to have at least an approximate equivalent in H2O's own >> (R-like) DSL, and wiring one operator with another's implementation seems >> like a tractable problem. >> > > It should be tractable, sure, even for map reduce. The question is whether > there's enough diversity to do certain optimizations in a certain way. E.g. > if two matrices are identically partitioned, then do map-side zip instead > of actual parallel join etc. > > But it should be tractable, indeed. > > >> The reason I write, is to better understand the split between the Mahout >> DSL and Spark (both current and future). As of today, the DSL seems to be >> pretty tightly coupled with Spark. >> >> E.g: >> >> - DSSVD.scala imports o.a.spark.storage.StorageLevel >> > > This is a known thing, I think i noted it somewhere in jira. That, and rdd > property of CheckpointedDRM. This needs to be abstracted away. > > >> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is >> DrmRddInput (instead of, say, DrmLike) >> > > CheckpointAction is part of physical layer. This is something that would > have to be completely re-written for a new engine. This is the "plugin" > api, but it is never user-facing (logical plan facing). > > >> Firstly, I don't think I am presenting some new revelation you guys don't >> already know - I'm sure you know that the logical vs physical "split" in >> the DSL is not absolute (yet). >> > > Aha. Exactly > > >> >> That being said, I would like to understand if there are plans, or >> efforts already underway to make the DSL (i.e how DSSVD would be written) >> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move >> the Spark specific code entirely into the physical domain. I recall Dmitry >> mentioning that a new engine other than Spark was also being planned, >> therefore I deduce some thought for such "purification" has already been >> applied. >> > > Aha. The hope is for Stratosphere. But there are few items that need to be > done by Stratosphere folks before we can leverage it fully. Or, let's say, > leverage it much better than we otherwise could. Makes sense to wait a bit. > > >> >> It would be nice to see changes approximately like: >> >> Rename ./spark => ./dsl >> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings => >> ./dsl/src/main/scala/org/apache/mahout/dsl >> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas => >> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend >> > > i was thinking along the lines factoring out public traits and logical > operators (DRMLike etc.) out of spark module into independent module > without particular engine dependencies. Exactly. It just hasn't come to > that yet. > > >> along with appropriately renaming packages and imports, and confining >> references to RDD and SparkContext completely within spark-backend. >> >> I think such a clean split would be necessary to introduce more backend >> engines. If no efforts are already underway, I would be glad to take on the >> DSL "purification" task. >> > > i think you got very close to my thinking about further steps here. Like i > said, i was just idling in wait for something like Stratosphere to become > closer to our orbit. > >> >> Thanks, >> Avati >> >> >> >> > >
