On Mon, Apr 28, 2014 at 9:48 AM, Dmitriy Lyubimov <[email protected]> wrote:
> > > > On Sun, Apr 27, 2014 at 8:39 PM, Anand Avati <[email protected]> wrote: > >> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]>wrote: >> >>> >>> >>> >>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote: >>> >>>> Hi Ted, Dmitry, >>>> Background: I am exploring the feasibility of providing H2O distributed >>>> "backend" to the DSL. >>>> >>> >>> >>> >> >> Yes, H2O has ways to do such things - a single map/reduce task on two >> matrices "side by side" which are similarly partitioned (i.e, sharing the >> same VectorGroup in H2O terminology) >> > > Ok. I guess another question i had was about internal data > representation. > First, distributed architecture assumes the engines are agnostic of type > of payload as long as external serialization is provided. The way it was > explained so far was h2o is tightly bound to a particular data > representation in the back. > Second, what we do here in Mahout is we assume the back end can make data > available in form of vertical Matrix blocks to user closures running in the > backend. > H2O's natural direction is column optimizing. So the user closures running in the backend would encounter blocks of hortizontal Matrix blocks. > Again, it was repeatedly explained that h2o has no matrix representation > for backend things > H2O has a strong 2-D "Frame". The Matrix abstraction over it was what built in github.com/tdunning/h2o-matrix, which is mostly providing Matrix-ish sounding names to functionality which already existed on a H2O Frame. > So it looks like we cannot plug mahout-math as backend blockwise matrix > representation, nor we have access to an alternative Matrix based vertical > blocking. How that is to be resolved, in your opinion? > We can trivially provide (sub-)Matrix access with horizontal blocking in H2O's mapreduce() - i.e, the mapper method in H2O's map/reduce API gets access to a batch of rows, local to the compute node, one batch per mapper call. This is almost natural to H2O. The per-row mapper API in H2OMatrix is a wrapper around the per-rowbatch internal API. And I think the horizontal vs vertical is an arbitrary choice and a reconcilable problem (transparently transposing the matrix in the H2OMatrix layer). >> >>> The reason I write, is to better understand the split between the Mahout >>>> DSL and Spark (both current and future). As of today, the DSL seems to be >>>> pretty tightly coupled with Spark. >>>> >>>> E.g: >>>> >>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel >>>> >>> >>> This is a known thing, I think i noted it somewhere in jira. That, and >>> rdd property of CheckpointedDRM. This needs to be abstracted away. >>> >>> >>>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is >>>> DrmRddInput (instead of, say, DrmLike) >>>> >>> >>> CheckpointAction is part of physical layer. This is something that would >>> have to be completely re-written for a new engine. This is the "plugin" >>> api, but it is never user-facing (logical plan facing). >>> >> >> It somehow felt that the optimizer was logical-ish. Do you mean the >> optimizations in CheckpointAction are specific to Spark and cannot be >> inherited in general to other backends (not that I think that is wrong)? >> > > Well there are 3 things. There's logical plan (operator DAG), there's a > physical DAG, and there's optimizer rewrite & cost logic that transforms > logical into physical. > > Logical DAG is user-facing and the top level. However, IMO logic that > makes rewrites into physical DAG, IMO should be engine specific in order to > be able to capitalize on engine specific things. It probably would share a > lot of commonalities (e.g. we could just maintain common pool of physical > operators assuming some commonalities between physical engine > implementations) but the cost-rewriting part should still be specific even > if it is very similar to any of existing. > > I also want to reserve a future work for spark optimizer exclusively that > calls up on advanced dynamic load scheduling techniques that were > thoroughly investigated in SystemML project. > > >> >> >>> >>>> Firstly, I don't think I am presenting some new revelation you guys >>>> don't already know - I'm sure you know that the logical vs physical "split" >>>> in the DSL is not absolute (yet). >>>> >>> >>> Aha. Exactly >>> >>> >>>> >>>> That being said, I would like to understand if there are plans, or >>>> efforts already underway to make the DSL (i.e how DSSVD would be written) >>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move >>>> the Spark specific code entirely into the physical domain. I recall Dmitry >>>> mentioning that a new engine other than Spark was also being planned, >>>> therefore I deduce some thought for such "purification" has already been >>>> applied. >>>> >>> >>> Aha. The hope is for Stratosphere. But there are few items that need to >>> be done by Stratosphere folks before we can leverage it fully. Or, let's >>> say, leverage it much better than we otherwise could. Makes sense to wait a >>> bit. >>> >>> >>>> >>>> It would be nice to see changes approximately like: >>>> >>>> Rename ./spark => ./dsl >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings => >>>> ./dsl/src/main/scala/org/apache/mahout/dsl >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas => >>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend >>>> >>> >>> i was thinking along the lines factoring out public traits and logical >>> operators (DRMLike etc.) out of spark module into independent module >>> without particular engine dependencies. Exactly. It just hasn't come to >>> that yet. >>> >>> >>>> along with appropriately renaming packages and imports, and confining >>>> references to RDD and SparkContext completely within spark-backend. >>>> >>>> I think such a clean split would be necessary to introduce more backend >>>> engines. If no efforts are already underway, I would be glad to take on the >>>> DSL "purification" task. >>>> >>> >>> i think you got very close to my thinking about further steps here. Like >>> i said, i was just idling in wait for something like Stratosphere to become >>> closer to our orbit. >>> >> >> OK, I think there is reasonable alignment on the goal. But you were not >> clear on whether you are going to be doing the purification split in the >> near future? or is that still an "unassigned task" which I can pick up? >> > yes it is unassigned and frankly i thought i might want to continue > working on this separation. However, you are welcome to take a stab, > especially if you are see a clear path for implementing mapBlock() operator > in h20 per my questions above without changing its signatures. > Subject to my understanding that mapBlock() slices a matrix into batches of columns, I am tempted to believe the signature need not change. I am not too concerned about the row vs column orientation just yet. Avati
