Of course, now that I put this in JIRA I'm wondering if treating similarity as the main neighbourhood membership determiner... In other words, what I wrote says: Include all users whose similarity to target user is > minSimilarity. Then, if the hood is large, optionally trim the hood to maxHoodSize.
Scary things happen (read: slowness) if you use minSimilarity=0.001 or some other small number. This will create a large hood. So now I'm wondering if one should use maxHoodSize as the primary determiner, so that the code instead does this: Include top maxHoodSize users. Then remove all users whose similarity to target user is < minSimilarity. I tested both approaches and they are equally fast UF you pick good minSimilarity. But if you pick an overly low similarity.... ouch - huge hood + slow. If you pick to high minSimilarity you risk finding no users that meet that criterium. The drawback of purely n-nearest approach is that the n-nearest people may really not be very near. Consequently, recommendations derived from them will not be the best. My change tries to guard against that, but one might argue that getting some not-so-good recommendations is still better then getting no recommendations (e.g because the given minSimilarity disqualifies all users and results in 0-sized neighbourhood). Thinking our loud... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Otis Gospodnetic (JIRA) <[EMAIL PROTECTED]> > To: [email protected] > Sent: Friday, November 7, 2008 1:04:54 PM > Subject: [jira] Created: (MAHOUT-95) UserSimilarity-based NearestNNeighborhood > > UserSimilarity-based NearestNNeighborhood > ----------------------------------------- > > Key: MAHOUT-95 > URL: https://issues.apache.org/jira/browse/MAHOUT-95 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Reporter: Otis Gospodnetic > Priority: Minor > Attachments: UserSimilarityNearestNUserNeighborhood.java > > A variation of NearestNUserNeighborhood. This version adds the minSimilarity > parameter, which is the primary factor for including/excluding other users > from > the target user's neighbourhood. Additionally, the 'n' parameter was renamed > to > maxHoodSize and is used to optionally limit the size of the neighbourhood. > > The patch is for a brand new class, but we may really want just a single > class > (either keep this one and axe NearestNUserNeighborhood or add this > functionality > to NearestNUserNeighborhood), if this sounds good. > > I'll update the unit test and provide a patch for that if others think this > can > go in. > > Thoughts? > > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online.
