Re: Mahout DSL vs Spark

Anand Avati Mon, 28 Apr 2014 10:33:43 -0700

On Mon, Apr 28, 2014 at 9:48 AM, Dmitriy Lyubimov <[email protected]> wrote:


>
>
>
> On Sun, Apr 27, 2014 at 8:39 PM, Anand Avati <[email protected]> wrote:
>
>> On Sun, Apr 27, 2014 at 8:07 PM, Dmitriy Lyubimov <[email protected]>wrote:
>>
>>>
>>>
>>>
>>> On Sun, Apr 27, 2014 at 7:57 PM, Anand Avati <[email protected]> wrote:
>>>
>>>> Hi Ted, Dmitry,
>>>> Background: I am exploring the feasibility of providing H2O distributed
>>>> "backend" to the DSL.
>>>>
>>>
>>>
>>>
>>
>> Yes, H2O has ways to do such things - a single map/reduce task on two
>> matrices "side by side" which are similarly partitioned (i.e, sharing the
>> same VectorGroup in H2O terminology)
>>
>
> Ok. I  guess another question i had was about internal data
> representation.
> First,  distributed architecture assumes the engines are agnostic of type
> of payload as long as external serialization is provided. The way it was
> explained so far was h2o is tightly bound to a particular data
> representation in the back.
> Second, what we do here in Mahout is we assume the back end can make data
> available in form of vertical Matrix blocks to user closures running in the
> backend.
>

H2O's natural direction is column optimizing. So the user closures running
in the backend would encounter blocks of hortizontal Matrix blocks.


> Again, it was repeatedly explained that h2o has no matrix representation
> for backend things
>

H2O has a strong 2-D "Frame". The Matrix abstraction over it was what built
in github.com/tdunning/h2o-matrix, which is mostly providing Matrix-ish
sounding names to functionality which already existed on a H2O Frame.


> So it looks like we cannot plug mahout-math  as backend blockwise matrix
> representation, nor we have access to an alternative Matrix based vertical
> blocking. How that is to be resolved, in your opinion?
>

We can trivially provide (sub-)Matrix access with horizontal blocking in
H2O's mapreduce() - i.e, the mapper method in H2O's map/reduce API gets
access to a batch of rows, local to the compute node, one batch per mapper
call. This is almost natural to H2O. The per-row mapper API in H2OMatrix is
a wrapper around the per-rowbatch internal API. And I think the horizontal
vs vertical is an arbitrary choice and a reconcilable problem
(transparently transposing the matrix in the H2OMatrix layer).


>>
>>> The reason I write, is to better understand the split between the Mahout
>>>> DSL and Spark (both current and future). As of today, the DSL seems to be
>>>> pretty tightly coupled with Spark.
>>>>
>>>>  E.g:
>>>>
>>>> - DSSVD.scala imports o.a.spark.storage.StorageLevel
>>>>
>>>
>>> This is a known thing, I think i noted it somewhere in jira. That, and
>>> rdd property of CheckpointedDRM. This needs to be abstracted away.
>>>
>>>
>>>> - drm.plan.CheckpointAction: the result of exec() and checkpoint() is
>>>> DrmRddInput (instead of, say, DrmLike)
>>>>
>>>
>>> CheckpointAction is part of physical layer. This is something that would
>>> have to be completely re-written for a new engine. This is the "plugin"
>>> api, but it is never user-facing (logical plan facing).
>>>
>>
>> It somehow felt that the optimizer was logical-ish. Do you mean the
>> optimizations in CheckpointAction are specific to Spark and cannot be
>> inherited in general to other backends (not that I think that is wrong)?
>>
>
> Well there are 3 things. There's logical plan (operator DAG), there's a
> physical DAG, and there's optimizer rewrite & cost logic that transforms
> logical into physical.
>
> Logical DAG is user-facing and the top level. However, IMO logic that
> makes rewrites into physical DAG, IMO should be engine specific in order to
> be able to capitalize on engine specific things. It probably would share a
> lot of commonalities (e.g. we could just maintain common pool of physical
> operators assuming some commonalities between physical engine
> implementations) but the cost-rewriting part should still be specific even
> if it is very similar to any of existing.
>
> I also want to reserve a future work for spark optimizer exclusively that
> calls up on advanced dynamic load scheduling techniques that were
> thoroughly investigated in SystemML project.
>
>
>>
>>
>>>
>>>> Firstly, I don't think I am presenting some new revelation you guys
>>>> don't already know - I'm sure you know that the logical vs physical "split"
>>>> in the DSL is not absolute (yet).
>>>>
>>>
>>> Aha. Exactly
>>>
>>>
>>>>
>>>> That being said, I would like to understand if there are plans, or
>>>> efforts already underway to make the DSL (i.e how DSSVD would be written)
>>>> and the logical layer (i.e drm.plan.* optimizer etc) more "pure" and move
>>>> the Spark specific code entirely into the physical domain. I recall Dmitry
>>>> mentioning that a new engine other than Spark was also being planned,
>>>> therefore I deduce some thought for such "purification" has already been
>>>> applied.
>>>>
>>>
>>> Aha. The hope is for Stratosphere. But there are few items that need to
>>> be done by Stratosphere folks before we can leverage it fully. Or, let's
>>> say, leverage it much better than we otherwise could. Makes sense to wait a
>>> bit.
>>>
>>>
>>>>
>>>> It would be nice to see changes approximately like:
>>>>
>>>> Rename ./spark => ./dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings =>
>>>> ./dsl/src/main/scala/org/apache/mahout/dsl
>>>> Rename ./spark/src/main/scala/org/apache/mahout/sparkbindings/blas =>
>>>> ./dsl/main/scala/org/apache/mahout/dsl/spark-backend
>>>>
>>>
>>> i was thinking along the lines factoring out public traits and logical
>>> operators (DRMLike etc.)  out of spark module into independent module
>>> without particular engine dependencies. Exactly. It just hasn't come to
>>> that yet.
>>>
>>>
>>>> along with appropriately renaming packages and imports, and confining
>>>> references to RDD and SparkContext completely within spark-backend.
>>>>
>>>> I think such a clean split would be necessary to introduce more backend
>>>> engines. If no efforts are already underway, I would be glad to take on the
>>>> DSL "purification" task.
>>>>
>>>
>>> i think you got very close to my thinking about further steps here. Like
>>> i said, i was just idling in wait for something like Stratosphere to become
>>> closer to our orbit.
>>>
>>
>> OK, I think there is reasonable alignment on the goal. But you were not
>> clear on whether you are going to be doing the purification split in the
>> near future? or is that still an "unassigned task" which I can pick up?
>>
> yes it is unassigned and frankly i thought i might want to continue
> working on this separation. However, you are welcome to take a stab,
> especially if you are see a clear path for implementing mapBlock() operator
> in h20  per my questions above without changing its signatures.
>

Subject to my understanding that mapBlock() slices a matrix into batches of
columns, I am tempted to believe the signature need not change. I am not
too concerned about the row vs column orientation just yet.

Avati

Re: Mahout DSL vs Spark

Reply via email to