Sebastian, I'm still not sure how big or small problem the implicit val is. But I will keep the point in the back of my mind as I explore further. I agree that the flexibility of backends is a powerful feature making Mahout unique and attractive indeed. I hope to see the flexibility be exercised purely at runtime.
On Sun, Apr 27, 2014 at 11:11 PM, Sebastian Schelter <[email protected]> wrote: > Anand, > > I'd also love to see work on a cleaner separation between the DSL and > Spark. Another thing that should be tackled in the current code is that the > SparkContext has to be present as implicit val in some methods. > > Making the DSL run on different systems will be a powerful feature that > will make Mahout unique and attractive to a lot of users, as it doesn't > enforce a lock-in to a particular system. I've talked to a company recently > that exactly had this requirement, they decided against using Spark, but > would still be highly interested in running new Mahout recommenders built > using the DSL. > > --sebastian > > > > On 04/28/2014 05:39 AM, Anand Avati wrote: > >> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]> >> wrote: >> >> >>> >>> >>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote: >>> >>> Hi Ted, Dmitry, >>>> Background: I am exploring the feasibility of providing H2O distributed >>>> "backend" to the DSL. >>>> >>>> >>> Very cool. that's actually was one of my initial proposals on how to >>> approach this. Got pushed back on this though. >>> >>> >> We are exploring various means of integration. The Jira mentioned >> providing >> Matrix and Vector implementations as an initial exploration. That task by >> itself had a lot of value in terms of reconciling some ground level issues >> (build/mvn compatibility, highlighting some classloader related challenges >> etc. on the H2O side.) Plugging behind a common DSL makes sense, though >> there may be value in other points of integration too, to exploit H2O's >> strengths. >> >> >> >>> >>> At a high level it appears that implementing physical operators for >>>> DrmLike over H2O does not seem extremely challenging. All the operators >>>> in >>>> the DSL seem to have at least an approximate equivalent in H2O's own >>>> (R-like) DSL, and wiring one operator with another's implementation >>>> seems >>>> like a tractable problem. >>>> >>>> >>> It should be tractable, sure, even for map reduce. The question is >>> whether >>> there's enough diversity to do certain optimizations in a certain way. >>> E.g. >>> if two matrices are identically partitioned, then do map-side zip instead >>> of actual parallel join etc. >>> >>> But it should be tractable, indeed. >>> >>> >> >> Yes, H2O has ways to do such things - a single map/reduce task on two >> matrices "side by side" which are similarly partitioned (i.e, sharing the >> same VectorGroup in H2O terminology) >> >> >> >> The reason I write, is to better understand the split between the Mahout >>> >>>> DSL and Spark (both current and future). As of today, the DSL seems to >>>> be >>>> pretty tightly coupled with Spark. >>>> >>>> E.g: >>>> >>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel >>>> >>>> >>> This is a known thing, I think i noted it somewhere in jira. That, and >>> rdd >>> property of CheckpointedDRM. This needs to be abstracted away. >>> >>> >>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is >>>> DrmRddInput (instead of, say, DrmLike) >>>> >>>> >>> CheckpointAction is part of physical layer. This is something that would >>> have to be completely re-written for a new engine. This is the "plugin" >>> api, but it is never user-facing (logical plan facing). >>> >>> >> It somehow felt that the optimizer was logical-ish. Do you mean the >> optimizations in CheckpointAction are specific to Spark and cannot be >> inherited in general to other backends (not that I think that is wrong)? >> >> >> >>> Firstly, I don't think I am presenting some new revelation you guys >>>> don't >>>> already know - I'm sure you know that the logical vs physical "split" in >>>> the DSL is not absolute (yet). >>>> >>>> >>> Aha. Exactly >>> >>> >>> >>>> That being said, I would like to understand if there are plans, or >>>> efforts already underway to make the DSL (i.e how DSSVD would be >>>> written) >>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and >>>> move >>>> the Spark specific code entirely into the physical domain. I recall >>>> Dmitry >>>> mentioning that a new engine other than Spark was also being planned, >>>> therefore I deduce some thought for such "purification" has already been >>>> applied. >>>> >>>> >>> Aha. The hope is for Stratosphere. But there are few items that need to >>> be >>> done by Stratosphere folks before we can leverage it fully. Or, let's >>> say, >>> leverage it much better than we otherwise could. Makes sense to wait a >>> bit. >>> >>> >>> >>>> It would be nice to see changes approximately like: >>>> >>>> Rename ./spark => ./dsl >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings => >>>> ./dsl/src/main/scala/org/apache/mahout/dsl >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas => >>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend >>>> >>>> >>> i was thinking along the lines factoring out public traits and logical >>> operators (DRMLike etc.) out of spark module into independent module >>> without particular engine dependencies. Exactly. It just hasn't come to >>> that yet. >>> >>> >>> along with appropriately renaming packages and imports, and confining >>>> references to RDD and SparkContext completely within spark-backend. >>>> >>>> I think such a clean split would be necessary to introduce more backend >>>> engines. If no efforts are already underway, I would be glad to take on >>>> the >>>> DSL "purification" task. >>>> >>>> >>> i think you got very close to my thinking about further steps here. Like >>> i >>> said, i was just idling in wait for something like Stratosphere to become >>> closer to our orbit. >>> >>> >> OK, I think there is reasonable alignment on the goal. But you were not >> clear on whether you are going to be doing the purification split in the >> near future? or is that still an "unassigned task" which I can pick up? >> >> Avati >> >> >
