[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893 ]
Pat Ferrel commented on MAHOUT-1464: ------------------------------------ seems like the downsampleAndBinarize method is returning the wrong values. It is actually summing the values where it should be counting the non-zero elements????? // Downsample the interaction vector of each user for (userIndex <- 0 until keys.size) { val interactionsOfUser = block(userIndex, ::) // this is a Vector // if the values are non-boolean the sum will not be the number of interactions it will be a sum of strength-of-interaction, right? // val numInteractionsOfUser = interactionsOfUser.sum // doesn't this sum strength of interactions? val numInteractionsOfUser = interactionsOfUser.getNumNonZeroElements() // should do this I think val perUserSampleRate = math.min(maxNumInteractions, numInteractionsOfUser) / numInteractionsOfUser interactionsOfUser.nonZeroes().foreach { elem => val numInteractionsWithThing = numInteractions(elem.index) val perThingSampleRate = math.min(maxNumInteractions, numInteractionsWithThing) / numInteractionsWithThing if (random.nextDouble() <= math.min(perUserSampleRate, perThingSampleRate)) { // We ignore the original interaction value and create a binary 0-1 matrix // as we only consider whether interactions happened or did not happen downsampledBlock(userIndex, elem.index) = 1 } } > Cooccurrence Analysis on Spark > ------------------------------ > > Key: MAHOUT-1464 > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Environment: hadoop, spark > Reporter: Pat Ferrel > Assignee: Pat Ferrel > Fix For: 1.0 > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, > MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that > runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM > can be used as input. > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has > several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)