On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote:
> Hi Ted, Dmitry, > Background: I am exploring the feasibility of providing H2O distributed > "backend" to the DSL. > Very cool. that's actually was one of my initial proposals on how to approach this. Got pushed back on this though. > At a high level it appears that implementing physical operators for > DrmLike over H2O does not seem extremely challenging. All the operators in > the DSL seem to have at least an approximate equivalent in H2O's own > (R-like) DSL, and wiring one operator with another's implementation seems > like a tractable problem. > It should be tractable, sure, even for map reduce. The question is whether there's enough diversity to do certain optimizations in a certain way. E.g. if two matrices are identically partitioned, then do map-side zip instead of actual parallel join etc. But it should be tractable, indeed. > The reason I write, is to better understand the split between the Mahout > DSL and Spark (both current and future). As of today, the DSL seems to be > pretty tightly coupled with Spark. > > E.g: > > - DSSVD.scala imports o.a.spark.storage.StorageLevel > This is a known thing, I think i noted it somewhere in jira. That, and rdd property of CheckpointedDRM. This needs to be abstracted away. > - drm.plan.CheckpointAction: the result of exec() and checkpoint() is > DrmRddInput (instead of, say, DrmLike) > CheckpointAction is part of physical layer. This is something that would have to be completely re-written for a new engine. This is the "plugin" api, but it is never user-facing (logical plan facing). > Firstly, I don't think I am presenting some new revelation you guys don't > already know - I'm sure you know that the logical vs physical "split" in > the DSL is not absolute (yet). > Aha. Exactly > > That being said, I would like to understand if there are plans, or efforts > already underway to make the DSL (i.e how DSSVD would be written) and the > logical layer (i.e drm.plan.* optimizer etc) more "pure" and move the Spark > specific code entirely into the physical domain. I recall Dmitry mentioning > that a new engine other than Spark was also being planned, therefore I > deduce some thought for such "purification" has already been applied. > Aha. The hope is for Stratosphere. But there are few items that need to be done by Stratosphere folks before we can leverage it fully. Or, let's say, leverage it much better than we otherwise could. Makes sense to wait a bit. > > It would be nice to see changes approximately like: > > Rename ./spark => ./dsl > Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings => > ./dsl/src/main/scala/org/apache/mahout/dsl > Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas => > ./dsl/main/scala/org/apache/mahout/dsl/spark-backend > i was thinking along the lines factoring out public traits and logical operators (DRMLike etc.) out of spark module into independent module without particular engine dependencies. Exactly. It just hasn't come to that yet. > along with appropriately renaming packages and imports, and confining > references to RDD and SparkContext completely within spark-backend. > > I think such a clean split would be necessary to introduce more backend > engines. If no efforts are already underway, I would be glad to take on the > DSL "purification" task. > i think you got very close to my thinking about further steps here. Like i said, i was just idling in wait for something like Stratosphere to become closer to our orbit. > > Thanks, > Avati > > > >
