[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

Sean Owen (JIRA) Wed, 18 Mar 2009 14:06:16 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683170#action_12683170
 ]


Sean Owen commented on MAHOUT-103:
----------------------------------

1. How do you feel about, therefore, changing to use more abstract objects 
rather than, say, "Click"? These objects could be the existing ones, or 
modified or new ones. I think as you say the existing objects are about what is 
needed. That way the solution is that much more reusable. Same with the job -- 
the more it uses abstract/standard classes, the more reusable I think it looks.

2. Yeah the two interfaces are nearly identical: provide a method that takes 
two "items" as input and a numerical "score" as output. I suppose it just makes 
sense to use the existing ItemSimilarity interface in this section of the code.

3. Good question, here is my brief digression:

The code was originally written with an "on-line" model in mind -- 
recommendations happen in real-time. Over time that has proved inefficient or 
impractical for large data sets, though it remains quite nice for small- to 
medium-size data sets. Hence i have attempted to preserve the real-time model 
at the core, and build a batch-oriented extension around it using Hadoop.

The two are a bit separate, and that is fine. So in this section of the code, I 
don't mind attaching Hadoop-related jobs that are not intimately connected to 
the core code. I am trying to keep them as consistent as possible so that the 
original on-line and newer off-line models don't evolve into two separate 
worlds within this part of the code.

To be specific... well I don't know, I don't have a problem with adding this 
job actually. Ideally we build a bit more around it: takes as input the 
standard preference-file format as used by FileDataModel, and outputs a file 
format that can be ready by a new ItemSimillarity implementation that would 
read and cache all these results. That would be a nice step towards integrating 
with the core code.

This is something I have been remiss in - I wrote a job to do the 
pre-computation of item-item diffs for slope one but never wrote an 
implementation of DiffStorage that would read this output and operate based on 
those results. This would close the loop. 

How about we make #3 my part of this issue, to complete the connection between 
this job and the core code a bit more?

> Co-occurence based nearest neighbourhood
> ----------------------------------------
>
>                 Key: MAHOUT-103
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-103
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Collaborative Filtering
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: jira-103.patch
>
>
> Nearest neighborhood type queries for users/items can be answered efficiently 
> and effectively by analyzing the co-occurrence model of a user/item w.r.t 
> another. This patch aims at providing an implementation for answering such 
> queries based upon simple co-occurrence counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-103) Co-occurence based nearest neighbourhood

Reply via email to