[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

Dmitriy Lyubimov (JIRA) Wed, 19 Feb 2014 23:27:13 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13906699#comment-13906699
 ]


Dmitriy Lyubimov commented on MAHOUT-1346:
------------------------------------------

This is now tracked here 
https://github.com/dlyubimov/mahout-commits/tree/dev-1.0-spark
new module spark. 

I have been rewriting certain things anew. 

Concepts : 
(a) Logical operators (including DRM sources) are expressed as DRMLike trait.
(b) taking a note from spark book, DRM operators (such as %*% or t) form 
operator lineage.  Operator lineage does not get optimized into RDD until 
"action" applied (spark terminology used). 

(c) Unlike in spark, "action" doesn't really cause any execution but (1) 
forming optimized RDD sequence (2) producing "checkpointed" DRM. Consequently, 
"checkpointed" DRM has RDD lineage attached to it, which is also marked for 
cacheing. Subsequently additional lineages starting out of a checkpointed DRM, 
will not be able to optimize beyond this checkpoint.

(d) there's a "super action" on checkpointed RDD  - such as collection or 
persitence to HDFS that triggers, if necessary, optimization checkpoint and 
Spark action. 

E.g. 

{code}
val A = drmParallelize(...)

// doesn't do anything, give opportunity for operator lineage to grow further 
before being optimized
val squaredA = A.t %*% A

// we may trigger optimizer and RDD lineage generation and cacheing explicitly 
by: 
squaredA.checkpoint()

// Or, we can call "superAction" directly. This will trigger checkpoint() 
implicitly if not yet done
val inCoreSquaredA = squaredA.collect()
{code}

Generally, i support for very few things -- I actually dropped all previously 
implemented Bagel algorithms. So in fact i have less support now than in 0.9 
branch. 

i have kryo support for Mahout vectors and matrix blocks. 
I have hdfs read/write of Mahout's DRM into DRMLike trait. 

I have some DSL defined such as 
A %*% B 
A %*% inCoreB
inCoreA %*%: B

A.t
inCoreA = A.collect

A.blockify (coalesces split records into RDD of vertical blocks -- sort of 
paradigm simiilar to MLI's MatrixSubmatrix except I implemented it before MLI 
was announced for the first time :) so no MLI influence here in fact )

So now i need to reimplement what Bagel used to be doing, plus optimizer rules 
for choosing distributed algorithm based on cost rules.

In fact i came to conclusion there was 0 benefit in using Bagel in the first 
place, since it just maps all its primitives into shuffle-and-hash group-by RDD 
operations so there is no any actual operational benefit to using it.

I probably will reconstitute algorithms at the first iteration using regular 
spark primitives (groupBy and cartesian for multiplication blocks)

Once i plug missing pieces (e.g. slim matrix multiplication) I bet i would be 
able to fit distributed SSVD version in 40 lines just like the in-core one :)

Weighted ALS will still be looking less elegant because of some lacking 
features in linear algebra. For example, it seems like sparse block support 
(i.e. bunch of sparse row or column vectors hanging off a very small hash map 
instead of full-size array as in SparseRow(column)Matrix today), but still 
mostly R-like scripted as far as working with matrix blocks and decompositions.

So at this point i'd be willing to hear input on these ideas and direction. 
Perhaps some suggestions. Thanks.


> Spark Bindings (DRM)
> --------------------
>
>                 Key: MAHOUT-1346
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1346
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.8
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 1.0
>
>
> Spark bindings for Mahout DRM. 
> DRM DSL. 
> Disclaimer. This will all be experimental at this point.
> The idea is to wrap DRM by Spark RDD with support of some basic 
> functionality, perhaps some humble beginning of Cost-based optimizer 
> (0) Spark serialization support for Vector, Matrix 
> (1) Bagel transposition 
> (2) slim X'X
> (2a) not-so-slim X'X
> (3) blockify() (compose RDD containing vertical blocks of original input)
> (4) read/write Mahout DRM off HDFS
> (5) A'B
> ...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (MAHOUT-1346) Spark Bindings (DRM)

Reply via email to