Do you need to compute the similarity between all pairs of users in order to measure similarity between any two users? no, not at all. There are several implementations of UserSimilarity and in general they will only look at the data associated to the two users being compared, not all users.
Computing a neighborhood is different. There, in theory, you do need to compute the similarity between one user, and all other users (but still, not all pairs), and pick some set of most-similar users. (And there are optimizations -- for example, you could sample 10% of all other users to form a "pretty good" neighborhood rather than actually look at everyone else.) You bring up clustering. Indeed that is one approach. You start by clustering users -- basically, making a bunch of disjoint neighborhoods ahead of time -- and then recommending from within the cluster. You can do that somewhat more efficiently than looking at all pairs, still. See TreeClusteringRecommender. Yes, anything that requires looking at all pairs of users could be disastrously slow. If you have a lot of users, but few items, consider using an item-based recommender instead. This would scale better. On Tue, Jul 7, 2009 at 12:36 AM, charlysf<[email protected]> wrote: > > Hello, > > I currently working on a small database, I understand that, when I need the > similarity between users, it's basically the compute between all pairs of > users. > > It's that ? or it's better ? > If it's that, how can I expect a quick compute for 1 million rows ? > > I don't see what is the difference between asking for the neighborhood, to > compute the similarity for all pairs of users. > > Because I thought, something could be interesting : > Make some clusters of users, and only compute the similarity between users > in my cluster. > > Thanks > -- > View this message in context: > http://www.nabble.com/Compute-similarities-for-an-hudge-quantity-of-data-tp24364711p24364711.html > Sent from the Mahout User List mailing list archive at Nabble.com. > >
