[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861520#action_12861520
 ] 

Ted Dunning commented on MAHOUT-305:
------------------------------------

My own approach in the past was to group on user to get a count as well as a 
list of items for that user.  This can be done in one MR step with a bit of 
fancy footwork or in two if you want simple.  The fancy footwork involves 
reading the item list into memory as we sample to avoid keeping too many.  It 
is relatively easy to do the sampling in a completely fair way, allowing all 
samples equal chance of survival by using a swapping algorithm.  A completely 
sampling is also trivial.  With some thought, it is probably possible to do 
various recency weighted samples as well.  

With the count for the items for each user, or a clever on-line sampling 
algorithm I can down-sample the user list before running the actual 
cooccurrence counting step.  This is a good point to drop users with < k_min 
items.  k_min should be at least 2 since users with one item cannot give rise 
to non-trivial cooccurrence.  A value of 3-5 isn't bad either.

The total time involved is pretty dominated by the original data reading so the 
extra MR step doesn't hurt all that much.  The win obtained by avoiding 
quadratic explosion of the cooccurrence step is massive.



> Combine both cooccurrence-based CF M/R jobs
> -------------------------------------------
>
>                 Key: MAHOUT-305
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-305
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Ankur
>            Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to