I'm reading this discussion with great interest. As you stress the importance of keeping the item-similarity-matrix sparse, I think it would be a useful improvement to add an option like "maxSimilaritiesPerItem" to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob, which would make it try to cut down the number of similar items per item.
However as we store each similarity pair only once it could happen that there are more than "maxSimilaritiesPerItem" similar items for a single item as we can't drop some of the pairs because the other item in the pair might have to little similarities otherwise. I could add this feature if you agree that its useful this way. If one wishes to drop similarities below a certain cutoff, this could be done in a custom implementation of o.a.m.cf.taste.hadoop.similarity.DistributedItemSimilarity by simply returning NaN if the computed similarity is below that cutoff value. -sebastian 2010/6/1 Ted Dunning <[email protected]> > I normally deal with this by purposefully limiting the length of these > rows. > The argument is that if I never recommend more than 100 items to a person > (or 20 or 1000 ... the argument doesn't change), then none of the item -> > item* mappings needs to have more than 100 items since the tail of the list > can't affect the top 100 recommendations anyway. It is also useful to > limit > the user history to either only recent or only important ratings. That > means that a typical big multi-get is something like 100 history items x > 100 > related items = 10,000 items x 10 bytes for id+score. This sounds kind of > big, but the average case is 5x smaller. > > On Mon, May 31, 2010 at 4:01 PM, Sean Owen <[email protected]> wrote: > > > I'd be a little concerned about whether this fits comfortably in > > memory. The similarity matrix is potentially dense -- big rows -- and > > you're loading one row per item the user has rated. It could get into > > tens of megabytes for one query. The distributed version dares not do > > this. But, worth a try in principle. > > >
