[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028342#comment-14028342
 ] 

Ted Dunning commented on MAHOUT-1464:
-------------------------------------

I don't understand the question.

In fact, the transpose is never computed explicitly.  There is a special 
operation that does A' A in a single pass and step.  It is possible to fuse the 
down-sampling into this multiplication, but not possible to fuse the column 
counts.  For large sparse A, the value of A'A is computed using a map-reduce 
style data flow where each row is examined and all cooccurrence counts are 
emitted to be grouped by row number later.

In order to save memory, it is probably a good idea to discard the original 
counts as soon as they are reduced to binary form and down-sampled.

For computing counts, it is possible to accumulate column sums in a row-wise 
accumulator at the same time that row sums are accumulated one element at a 
time.  This avoids a pass over A and probably helps significantly on speed.


> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to