Re: Mahout DSL vs Spark

Dmitriy Lyubimov Mon, 28 Apr 2014 15:27:29 -0700

yes it is just one more variable to abstract way. to wrap into something
like MahoutContext.



On Mon, Apr 28, 2014 at 2:43 PM, Anand Avati <[email protected]> wrote:

> Sebastian,
> I'm still not sure how big or small problem the implicit val is. But I will
> keep the point in the back of my mind as I explore further. I agree that
> the flexibility of backends is a powerful feature making Mahout unique and
> attractive indeed. I hope to see the flexibility be exercised purely at
> runtime.
>
>
>
> On Sun, Apr 27, 2014 at 11:11 PM, Sebastian Schelter <[email protected]>
> wrote:
>
> > Anand,
> >
> > I'd also love to see work on a cleaner separation between the DSL and
> > Spark. Another thing that should be tackled in the current code is that
> the
> > SparkContext has to be present as implicit val in some methods.
> >
> > Making the DSL run on different systems will be a powerful feature that
> > will make Mahout unique and attractive to a lot of users, as it doesn't
> > enforce a lock-in to a particular system. I've talked to a company
> recently
> > that exactly had this requirement, they decided against using Spark, but
> > would still be highly interested in running new Mahout recommenders built
> > using the DSL.
> >
> > --sebastian
> >
> >
> >
> > On 04/28/2014 05:39 AM, Anand Avati wrote:
> >
> >> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]>
> >> wrote:
> >>
> >>
> >>>
> >>>
> >>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]>
> wrote:
> >>>
> >>>  Hi Ted, Dmitry,
> >>>> Background: I am exploring the feasibility of providing H2O
> distributed
> >>>> "backend" to the DSL.
> >>>>
> >>>>
> >>> Very cool. that's actually was one of my initial proposals on how to
> >>> approach this. Got pushed back on this though.
> >>>
> >>>
> >> We are exploring various means of integration. The Jira mentioned
> >> providing
> >> Matrix and Vector implementations as an initial exploration. That task
> by
> >> itself had a lot of value in terms of reconciling some ground level
> issues
> >> (build/mvn compatibility, highlighting some classloader related
> challenges
> >> etc. on the H2O side.) Plugging behind a common DSL makes sense, though
> >> there may be value in other points of integration too, to exploit H2O's
> >> strengths.
> >>
> >>
> >>
> >>>
> >>>  At a high level it appears that implementing physical operators for
> >>>> DrmLike over H2O does not seem extremely challenging. All the
> operators
> >>>> in
> >>>> the DSL seem to have at least an approximate equivalent in H2O's own
> >>>> (R-like) DSL, and wiring one operator with another's implementation
> >>>> seems
> >>>> like a tractable problem.
> >>>>
> >>>>
> >>> It should be tractable, sure, even for map reduce. The question is
> >>> whether
> >>> there's enough diversity to do certain optimizations in a certain way.
> >>> E.g.
> >>> if two matrices are identically partitioned, then do map-side zip
> instead
> >>> of actual parallel join etc.
> >>>
> >>> But it should be tractable, indeed.
> >>>
> >>>
> >>
> >> Yes, H2O has ways to do such things - a single map/reduce task on two
> >> matrices "side by side" which are similarly partitioned (i.e, sharing
> the
> >> same VectorGroup in H2O terminology)
> >>
> >>
> >>
> >>  The reason I write, is to better understand the split between the
> Mahout
> >>>
> >>>> DSL and Spark (both current and future). As of today, the DSL seems to
> >>>> be
> >>>> pretty tightly coupled with Spark.
> >>>>
> >>>> E.g:
> >>>>
> >>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
> >>>>
> >>>>
> >>> This is a known thing, I think i noted it somewhere in jira. That, and
> >>> rdd
> >>> property of CheckpointedDRM. This needs to be abstracted away.
> >>>
> >>>
> >>>  - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
> >>>> DrmRddInput (instead of, say, DrmLike)
> >>>>
> >>>>
> >>> CheckpointAction is part of physical layer. This is something that
> would
> >>> have to be completely re-written for a new engine. This is the "plugin"
> >>> api, but it is never user-facing (logical plan facing).
> >>>
> >>>
> >> It somehow felt that the optimizer was logical-ish. Do you mean the
> >> optimizations in CheckpointAction are specific to Spark and cannot be
> >> inherited in general to other backends (not that I think that is wrong)?
> >>
> >>
> >>
> >>>  Firstly, I don't think I am presenting some new revelation you guys
> >>>> don't
> >>>> already know - I'm sure you know that the logical vs physical "split"
> in
> >>>> the DSL is not absolute (yet).
> >>>>
> >>>>
> >>> Aha. Exactly
> >>>
> >>>
> >>>
> >>>> That being said, I would like to understand if there are plans, or
> >>>> efforts already underway to make the DSL (i.e how DSSVD would be
> >>>> written)
> >>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and
> >>>> move
> >>>> the Spark specific code entirely into the physical domain. I recall
> >>>> Dmitry
> >>>> mentioning that a new engine other than Spark was also being planned,
> >>>> therefore I deduce some thought for such "purification" has already
> been
> >>>> applied.
> >>>>
> >>>>
> >>> Aha. The hope is for Stratosphere. But there are few items that need to
> >>> be
> >>> done by Stratosphere folks before we can leverage it fully. Or, let's
> >>> say,
> >>> leverage it much better than we otherwise could. Makes sense to wait a
> >>> bit.
> >>>
> >>>
> >>>
> >>>> It would be nice to see changes approximately like:
> >>>>
> >>>> Rename ./spark => ./dsl
> >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
> >>>> ./dsl/src/main/scala/org/apache/mahout/dsl
> >>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
> >>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
> >>>>
> >>>>
> >>> i was thinking along the lines factoring out public traits and logical
> >>> operators (DRMLike etc.)  out of spark module into independent module
> >>> without particular engine dependencies. Exactly. It just hasn't come to
> >>> that yet.
> >>>
> >>>
> >>>  along with appropriately renaming packages and imports, and confining
> >>>> references to RDD and SparkContext completely within spark-backend.
> >>>>
> >>>> I think such a clean split would be necessary to introduce more
> backend
> >>>> engines. If no efforts are already underway, I would be glad to take
> on
> >>>> the
> >>>> DSL "purification" task.
> >>>>
> >>>>
> >>> i think you got very close to my thinking about further steps here.
> Like
> >>> i
> >>> said, i was just idling in wait for something like Stratosphere to
> become
> >>> closer to our orbit.
> >>>
> >>>
> >> OK, I think there is reasonable alignment on the goal. But you were not
> >> clear on whether you are going to be doing the purification split in the
> >> near future? or is that still an "unassigned task" which I can pick up?
> >>
> >> Avati
> >>
> >>
> >
>

Re: Mahout DSL vs Spark

Reply via email to