[ https://issues.apache.org/jira/browse/MAHOUT-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683170#action_12683170 ]
Sean Owen commented on MAHOUT-103: ---------------------------------- 1. How do you feel about, therefore, changing to use more abstract objects rather than, say, "Click"? These objects could be the existing ones, or modified or new ones. I think as you say the existing objects are about what is needed. That way the solution is that much more reusable. Same with the job -- the more it uses abstract/standard classes, the more reusable I think it looks. 2. Yeah the two interfaces are nearly identical: provide a method that takes two "items" as input and a numerical "score" as output. I suppose it just makes sense to use the existing ItemSimilarity interface in this section of the code. 3. Good question, here is my brief digression: The code was originally written with an "on-line" model in mind -- recommendations happen in real-time. Over time that has proved inefficient or impractical for large data sets, though it remains quite nice for small- to medium-size data sets. Hence i have attempted to preserve the real-time model at the core, and build a batch-oriented extension around it using Hadoop. The two are a bit separate, and that is fine. So in this section of the code, I don't mind attaching Hadoop-related jobs that are not intimately connected to the core code. I am trying to keep them as consistent as possible so that the original on-line and newer off-line models don't evolve into two separate worlds within this part of the code. To be specific... well I don't know, I don't have a problem with adding this job actually. Ideally we build a bit more around it: takes as input the standard preference-file format as used by FileDataModel, and outputs a file format that can be ready by a new ItemSimillarity implementation that would read and cache all these results. That would be a nice step towards integrating with the core code. This is something I have been remiss in - I wrote a job to do the pre-computation of item-item diffs for slope one but never wrote an implementation of DiffStorage that would read this output and operate based on those results. This would close the loop. How about we make #3 my part of this issue, to complete the connection between this job and the core code a bit more? > Co-occurence based nearest neighbourhood > ---------------------------------------- > > Key: MAHOUT-103 > URL: https://issues.apache.org/jira/browse/MAHOUT-103 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering > Reporter: Ankur > Assignee: Ankur > Attachments: jira-103.patch > > > Nearest neighborhood type queries for users/items can be answered efficiently > and effectively by analyzing the co-occurrence model of a user/item w.r.t > another. This patch aims at providing an implementation for answering such > queries based upon simple co-occurrence counts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.