[ 
https://issues.apache.org/jira/browse/MAHOUT-305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860882#action_12860882
 ] 

Ankur commented on MAHOUT-305:
------------------------------

CooccurrenceCombiner caches items internally and increments counts whenever it 
sees a new value. This might lead to memory issues with some real big datasets. 
Moreover, for every (item-id, count)  cached, a new object is created to apply 
a simple procedure. Looks an overkill to me.

With the secondary sort (item1, item2)  pairs are already sorted so that for 
each key (item1) all the (item1, item2) pairs appear before (item1, item3) 
assuming item2 < item3. With this we simple increment the count each time we 
see item2 and put the (item2, count) entry into a priority queue as soon as we 
see item3 or something else. The size of the priority queue can be limited to 
N.  Check out ItemSimilarityEstimator.java.

Agreed we need better facilities for pruning, something like support-count (any 
other?).

About merging, I feel CooccurrenceCombiner would be better with secondary sort. 
Also it will be good if we can retain TupleWritable for future use. Other than 
these I have no issues with throwing away code under 
o.a.m.cf.taste.hadoop.cooccurrence

> Combine both cooccurrence-based CF M/R jobs
> -------------------------------------------
>
>                 Key: MAHOUT-305
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-305
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.2
>            Reporter: Sean Owen
>            Assignee: Ankur
>            Priority: Minor
>
> We have two different but essentially identical MapReduce jobs to make 
> recommendations based on item co-occurrence: 
> org.apache.mahout.cf.taste.hadoop.{item,cooccurrence}. They ought to be 
> merged. Not sure exactly how to approach that but noting this in JIRA, per 
> Ankur.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to