[ 
https://issues.apache.org/jira/browse/MAHOUT-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895569#action_12895569
 ] 

Ted Dunning commented on MAHOUT-455:
------------------------------------

At the risk of being rude, I think that this would not fix a thing since the 
current behavior is correct and follows directly from first principles.

The reason that the algorithm is called "nearest-neighbors" is because that is 
what it does.  The intent is to use only the nearest few neighbors.  This 
method has a long lineage in data-mining and the crux of it all is that 
"nearest few" part.  That is what allows it to express very complex relations 
by locally linear relations.


If you really want to include all users, just compute the single average of all 
users off-line and be done with it.  This works because if you include any 
weighting by proximity, then  you can use a moderate number of neighbors and 
get the same result (i.e. the current behavior).  If your weighting does not 
strongly depend on distance as must be the case if all of the users are 
included in the sample, then you have a measure that does not depend on the 
user you are recommending for, that is, you have reinvented a "most-popular" 
recommendation.  If that is what you want, then you should use that and not a 
nearest neighbor recommender.

Either way, I think that the current code isn't broken.



> NearestNUserNeighborhood problems with large Ns
> -----------------------------------------------
>
>                 Key: MAHOUT-455
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-455
>             Project: Mahout
>          Issue Type: Bug
>          Components: Collaborative Filtering
>    Affects Versions: 0.3
>         Environment: Linux
>            Reporter: Yanir Seroussi
>            Priority: Minor
>
> I set a large n for NearestNUserNeighborhood, with the intention of including 
> all users in the neighbourhood. However, I encountered the following problems:
> (1) If n is set to Integer.MAX_VALUE, the program crashes with the following 
> stack trace:
> Exception in thread "main" java.lang.IllegalArgumentException
>       at java.util.PriorityQueue.<init>(PriorityQueue.java:152)
>       at 
> org.apache.mahout.cf.taste.impl.recommender.TopItems.getTopUsers(TopItems.java:93)
>       at 
> org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood.getUserNeighborhood(NearestNUserNeighborhood.java:111)
> This is because TopItems.getTopUsers() tries to create a PriorityQueue with a 
> capacity of Integer.MAX_VALUE + 1.
> (2) If n is set to a large integer value (e.g., 1 billion), it crashes with 
> the following stack trace:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>       at java.util.PriorityQueue.<init>(PriorityQueue.java:153)
>       at 
> org.apache.mahout.cf.taste.impl.recommender.TopItems.getTopUsers(TopItems.java:93)
>       at 
> org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood.getUserNeighborhood(NearestNUserNeighborhood.java:111)
> This is due to the same reason - trying to create a PriorityQueue with size n 
> + 1.
> In my opinion, this should be fixed by changing n to the number of users in 
> the DataModel when NearestNUserNeighborhood is created, or by letting users 
> specify n = -1 (or a similar value) when they want the user neighbourhood to 
> include all users.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to