[
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861520#action_12861520
]
Ted Dunning commented on MAHOUT-305:
------------------------------------
My own approach in the past was to group on user to get a count as well as a
list of items for that user. This can be done in one MR step with a bit of
fancy footwork or in two if you want simple. The fancy footwork involves
reading the item list into memory as we sample to avoid keeping too many. It
is relatively easy to do the sampling in a completely fair way, allowing all
samples equal chance of survival by using a swapping algorithm. A completely
sampling is also trivial. With some thought, it is probably possible to do
various recency weighted samples as well.
With the count for the items for each user, or a clever on-line sampling
algorithm I can down-sample the user list before running the actual
cooccurrence counting step. This is a good point to drop users with < k_min
items. k_min should be at least 2 since users with one item cannot give rise
to non-trivial cooccurrence. A value of 3-5 isn't bad either.
The total time involved is pretty dominated by the original data reading so the
extra MR step doesn't hurt all that much. The win obtained by avoiding
quadratic explosion of the cooccurrence step is massive.
> Combine both cooccurrence-based CF M/R jobs
> -------------------------------------------
>
> Key: MAHOUT-305
> URL: https://issues.apache.org/jira/browse/MAHOUT-305
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Affects Versions: 0.2
> Reporter: Sean Owen
> Assignee: Ankur
> Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make
> recommendations based on item co-occurrence:
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be
> merged. Not sure exactly how to approach that but noting this in JIRA, per
> Ankur.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.