Re: Mahout DSL vs Spark

Anand Avati Mon, 28 Apr 2014 14:44:29 -0700

Sebastian,
I'm still not sure how big or small problem the implicit val is. But I will
keep the point in the back of my mind as I explore further. I agree that
the flexibility of backends is a powerful feature making Mahout unique and
attractive indeed. I hope to see the flexibility be exercised purely at
runtime.




On Sun, Apr 27, 2014 at 11:11 PM, Sebastian Schelter <[email protected]> wrote:

> Anand,
>
> I'd also love to see work on a cleaner separation between the DSL and
> Spark. Another thing that should be tackled in the current code is that the
> SparkContext has to be present as implicit val in some methods.
>
> Making the DSL run on different systems will be a powerful feature that
> will make Mahout unique and attractive to a lot of users, as it doesn't
> enforce a lock-in to a particular system. I've talked to a company recently
> that exactly had this requirement, they decided against using Spark, but
> would still be highly interested in running new Mahout recommenders built
> using the DSL.
>
> --sebastian
>
>
>
> On 04/28/2014 05:39 AM, Anand Avati wrote:
>
>> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]>
>> wrote:
>>
>>
>>>
>>>
>>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote:
>>>
>>>  Hi Ted, Dmitry,
>>>> Background: I am exploring the feasibility of providing H2O distributed
>>>> "backend" to the DSL.
>>>>
>>>>
>>> Very cool. that's actually was one of my initial proposals on how to
>>> approach this. Got pushed back on this though.
>>>
>>>
>> We are exploring various means of integration. The Jira mentioned
>> providing
>> Matrix and Vector implementations as an initial exploration. That task by
>> itself had a lot of value in terms of reconciling some ground level issues
>> (build/mvn compatibility, highlighting some classloader related challenges
>> etc. on the H2O side.) Plugging behind a common DSL makes sense, though
>> there may be value in other points of integration too, to exploit H2O's
>> strengths.
>>
>>
>>
>>>
>>>  At a high level it appears that implementing physical operators for
>>>> DrmLike over H2O does not seem extremely challenging. All the operators
>>>> in
>>>> the DSL seem to have at least an approximate equivalent in H2O's own
>>>> (R-like) DSL, and wiring one operator with another's implementation
>>>> seems
>>>> like a tractable problem.
>>>>
>>>>
>>> It should be tractable, sure, even for map reduce. The question is
>>> whether
>>> there's enough diversity to do certain optimizations in a certain way.
>>> E.g.
>>> if two matrices are identically partitioned, then do map-side zip instead
>>> of actual parallel join etc.
>>>
>>> But it should be tractable, indeed.
>>>
>>>
>>
>> Yes, H2O has ways to do such things - a single map/reduce task on two
>> matrices "side by side" which are similarly partitioned (i.e, sharing the
>> same VectorGroup in H2O terminology)
>>
>>
>>
>>  The reason I write, is to better understand the split between the Mahout
>>>
>>>> DSL and Spark (both current and future). As of today, the DSL seems to
>>>> be
>>>> pretty tightly coupled with Spark.
>>>>
>>>> E.g:
>>>>
>>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>>>
>>>>
>>> This is a known thing, I think i noted it somewhere in jira. That, and
>>> rdd
>>> property of CheckpointedDRM. This needs to be abstracted away.
>>>
>>>
>>>  - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>>>> DrmRddInput (instead of, say, DrmLike)
>>>>
>>>>
>>> CheckpointAction is part of physical layer. This is something that would
>>> have to be completely re-written for a new engine. This is the "plugin"
>>> api, but it is never user-facing (logical plan facing).
>>>
>>>
>> It somehow felt that the optimizer was logical-ish. Do you mean the
>> optimizations in CheckpointAction are specific to Spark and cannot be
>> inherited in general to other backends (not that I think that is wrong)?
>>
>>
>>
>>>  Firstly, I don't think I am presenting some new revelation you guys
>>>> don't
>>>> already know - I'm sure you know that the logical vs physical "split" in
>>>> the DSL is not absolute (yet).
>>>>
>>>>
>>> Aha. Exactly
>>>
>>>
>>>
>>>> That being said, I would like to understand if there are plans, or
>>>> efforts already underway to make the DSL (i.e how DSSVD would be
>>>> written)
>>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and
>>>> move
>>>> the Spark specific code entirely into the physical domain. I recall
>>>> Dmitry
>>>> mentioning that a new engine other than Spark was also being planned,
>>>> therefore I deduce some thought for such "purification" has already been
>>>> applied.
>>>>
>>>>
>>> Aha. The hope is for Stratosphere. But there are few items that need to
>>> be
>>> done by Stratosphere folks before we can leverage it fully. Or, let's
>>> say,
>>> leverage it much better than we otherwise could. Makes sense to wait a
>>> bit.
>>>
>>>
>>>
>>>> It would be nice to see changes approximately like:
>>>>
>>>> Rename ./spark => ./dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>>>> ./dsl/src/main/scala/org/apache/mahout/dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>>>
>>>>
>>> i was thinking along the lines factoring out public traits and logical
>>> operators (DRMLike etc.)  out of spark module into independent module
>>> without particular engine dependencies. Exactly. It just hasn't come to
>>> that yet.
>>>
>>>
>>>  along with appropriately renaming packages and imports, and confining
>>>> references to RDD and SparkContext completely within spark-backend.
>>>>
>>>> I think such a clean split would be necessary to introduce more backend
>>>> engines. If no efforts are already underway, I would be glad to take on
>>>> the
>>>> DSL "purification" task.
>>>>
>>>>
>>> i think you got very close to my thinking about further steps here. Like
>>> i
>>> said, i was just idling in wait for something like Stratosphere to become
>>> closer to our orbit.
>>>
>>>
>> OK, I think there is reasonable alignment on the goal. But you were not
>> clear on whether you are going to be doing the purification split in the
>> near future? or is that still an "unassigned task" which I can pick up?
>>
>> Avati
>>
>>
>

Re: Mahout DSL vs Spark

Reply via email to