yes it is just one more variable to abstract way. to wrap into something like MahoutContext.
On Mon, Apr 28, 2014 at 2:43 PM, Anand Avati <[email protected]> wrote: > Sebastian, > I'm still not sure how big or small problem the implicit val is. But I will > keep the point in the back of my mind as I explore further. I agree that > the flexibility of backends is a powerful feature making Mahout unique and > attractive indeed. I hope to see the flexibility be exercised purely at > runtime. > > > > On Sun, Apr 27, 2014 at 11:11 PM, Sebastian Schelter <[email protected]> > wrote: > > > Anand, > > > > I'd also love to see work on a cleaner separation between the DSL and > > Spark. Another thing that should be tackled in the current code is that > the > > SparkContext has to be present as implicit val in some methods. > > > > Making the DSL run on different systems will be a powerful feature that > > will make Mahout unique and attractive to a lot of users, as it doesn't > > enforce a lock-in to a particular system. I've talked to a company > recently > > that exactly had this requirement, they decided against using Spark, but > > would still be highly interested in running new Mahout recommenders built > > using the DSL. > > > > --sebastian > > > > > > > > On 04/28/2014 05:39 AM, Anand Avati wrote: > > > >> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]> > >> wrote: > >> > >> > >>> > >>> > >>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> > wrote: > >>> > >>> Hi Ted, Dmitry, > >>>> Background: I am exploring the feasibility of providing H2O > distributed > >>>> "backend" to the DSL. > >>>> > >>>> > >>> Very cool. that's actually was one of my initial proposals on how to > >>> approach this. Got pushed back on this though. > >>> > >>> > >> We are exploring various means of integration. The Jira mentioned > >> providing > >> Matrix and Vector implementations as an initial exploration. That task > by > >> itself had a lot of value in terms of reconciling some ground level > issues > >> (build/mvn compatibility, highlighting some classloader related > challenges > >> etc. on the H2O side.) Plugging behind a common DSL makes sense, though > >> there may be value in other points of integration too, to exploit H2O's > >> strengths. > >> > >> > >> > >>> > >>> At a high level it appears that implementing physical operators for > >>>> DrmLike over H2O does not seem extremely challenging. All the > operators > >>>> in > >>>> the DSL seem to have at least an approximate equivalent in H2O's own > >>>> (R-like) DSL, and wiring one operator with another's implementation > >>>> seems > >>>> like a tractable problem. > >>>> > >>>> > >>> It should be tractable, sure, even for map reduce. The question is > >>> whether > >>> there's enough diversity to do certain optimizations in a certain way. > >>> E.g. > >>> if two matrices are identically partitioned, then do map-side zip > instead > >>> of actual parallel join etc. > >>> > >>> But it should be tractable, indeed. > >>> > >>> > >> > >> Yes, H2O has ways to do such things - a single map/reduce task on two > >> matrices "side by side" which are similarly partitioned (i.e, sharing > the > >> same VectorGroup in H2O terminology) > >> > >> > >> > >> The reason I write, is to better understand the split between the > Mahout > >>> > >>>> DSL and Spark (both current and future). As of today, the DSL seems to > >>>> be > >>>> pretty tightly coupled with Spark. > >>>> > >>>> E.g: > >>>> > >>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel > >>>> > >>>> > >>> This is a known thing, I think i noted it somewhere in jira. That, and > >>> rdd > >>> property of CheckpointedDRM. This needs to be abstracted away. > >>> > >>> > >>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is > >>>> DrmRddInput (instead of, say, DrmLike) > >>>> > >>>> > >>> CheckpointAction is part of physical layer. This is something that > would > >>> have to be completely re-written for a new engine. This is the "plugin" > >>> api, but it is never user-facing (logical plan facing). > >>> > >>> > >> It somehow felt that the optimizer was logical-ish. Do you mean the > >> optimizations in CheckpointAction are specific to Spark and cannot be > >> inherited in general to other backends (not that I think that is wrong)? > >> > >> > >> > >>> Firstly, I don't think I am presenting some new revelation you guys > >>>> don't > >>>> already know - I'm sure you know that the logical vs physical "split" > in > >>>> the DSL is not absolute (yet). > >>>> > >>>> > >>> Aha. Exactly > >>> > >>> > >>> > >>>> That being said, I would like to understand if there are plans, or > >>>> efforts already underway to make the DSL (i.e how DSSVD would be > >>>> written) > >>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and > >>>> move > >>>> the Spark specific code entirely into the physical domain. I recall > >>>> Dmitry > >>>> mentioning that a new engine other than Spark was also being planned, > >>>> therefore I deduce some thought for such "purification" has already > been > >>>> applied. > >>>> > >>>> > >>> Aha. The hope is for Stratosphere. But there are few items that need to > >>> be > >>> done by Stratosphere folks before we can leverage it fully. Or, let's > >>> say, > >>> leverage it much better than we otherwise could. Makes sense to wait a > >>> bit. > >>> > >>> > >>> > >>>> It would be nice to see changes approximately like: > >>>> > >>>> Rename ./spark => ./dsl > >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings => > >>>> ./dsl/src/main/scala/org/apache/mahout/dsl > >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas => > >>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend > >>>> > >>>> > >>> i was thinking along the lines factoring out public traits and logical > >>> operators (DRMLike etc.) out of spark module into independent module > >>> without particular engine dependencies. Exactly. It just hasn't come to > >>> that yet. > >>> > >>> > >>> along with appropriately renaming packages and imports, and confining > >>>> references to RDD and SparkContext completely within spark-backend. > >>>> > >>>> I think such a clean split would be necessary to introduce more > backend > >>>> engines. If no efforts are already underway, I would be glad to take > on > >>>> the > >>>> DSL "purification" task. > >>>> > >>>> > >>> i think you got very close to my thinking about further steps here. > Like > >>> i > >>> said, i was just idling in wait for something like Stratosphere to > become > >>> closer to our orbit. > >>> > >>> > >> OK, I think there is reasonable alignment on the goal. But you were not > >> clear on whether you are going to be doing the purification split in the > >> near future? or is that still an "unassigned task" which I can pick up? > >> > >> Avati > >> > >> > > >
