[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

Pat Ferrel (JIRA) Tue, 10 Jun 2014 15:43:30 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027159#comment-14027159
 ]


Pat Ferrel commented on MAHOUT-1464:
------------------------------------

I think the same thing is happening with number of item interactions:

    // Broadcast vector containing the number of interactions with each thing
    val bcastNumInteractions = drmBroadcast(drmI.colSums)// sums?

This broadcasts a vector of sums. We need a getNumNonZeroElements() for column 
vectors actually a way to get a Vector of nonZero counts per column? We could 
get them from rows of the transposed matrix before doing the multiply of At %*% 
A or B.t %*% A in which case we’d get nin-zero counts from the rows. Either way 
I don’t see a way to get a vector of these values without doing a mapBlock on 
the transposed matrix. Am I missing something?

Currently the IndexedDataset is a very thin wrapper but I could add two 
vectors, which contain number of non-zero elements for rows and columns. In 
this case I would have it extend CheckpointedDrm perhaps. Since CheckpointedDrm 
extends DrmLike it could be used in the DSL algebra directly, in which case it 
would be simple to do the right thing with these vectors as well as the two id 
dictionaries for transpose and multiply but it’s a slippery slope.

Before I go off in the wrong direction is there an existing way to get a vector 
of non-zero counts for rows or columns?


> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

Reply via email to