[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027986#comment-14027986
 ] 

Pat Ferrel commented on MAHOUT-1464:
------------------------------------

The algo transposes A (the primary) before self-coocurrence. That gives us a 
point to look at columns when they are rows, which in turn makes distributed 
ops on the drm simple. So rather than looking at the counts for columns, my 
earlier proposal was to look at the same data when it is a row. Might this be 
better since it can easily be a distributed calculation?

In other words since A.t * A is calculated, we can split this into transpose 
and multiply taking column counts from the rows of A.t then doing the multiply 
after. In the list of calculations: A.t * A, B.t * A, ... each include a state 
where the columns turn into rows and so the same approach can be used.

This introduces what was a bug as a significant optimization. If the data is 
already boolean, use the colSums then no distributed counting is needed.

Not sure if the above is all true, so read it as a question



> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to