Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

Sebastian Schelter Tue, 10 Jun 2014 08:32:26 -0700

Hi Pat,

We truncate the indicators to the top-k and you don't want theself-comparison in there. So I don't see a reason to not exclude it asearly as possible.


--sebatian

On 06/10/2014 05:28 PM, Pat Ferrel wrote:

Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

             // exclude co-occurrences of the item with itself
             if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter <s...@apache.org> wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA) <j...@apache.org> wrote:


     [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:
------------------------------------

seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?????

         // Downsample the interaction vector of each user
         for (userIndex <- 0 until keys.size) {

           val interactionsOfUser = block(userIndex, ::) // this is a Vector
           // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
           // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
           val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

           val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

           interactionsOfUser.nonZeroes().foreach { elem =>
             val numInteractionsWithThing = numInteractions(elem.index)
             val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

             if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
               // We ignore the original interaction value and create a
binary 0-1 matrix
               // as we only consider whether interactions happened or did
not happen
               downsampledBlock(userIndex, elem.index) = 1
             }
           }

Cooccurrence Analysis on Spark
------------------------------

                 Key: MAHOUT-1464
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
         Environment: hadoop, spark
            Reporter: Pat Ferrel
            Assignee: Pat Ferrel
             Fix For: 1.0

         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

Reply via email to