Sorry, typo, that's what I meant. yes the difference isn't *that* large! It may be worse in practice since you have a few users with very many prefs. It may also be beneficial to simply have one fewer phase and throw around less data. I will also try to benchmark since really that's the only way to know.
On Wed, Apr 28, 2010 at 3:01 PM, Sebastian Schelter <[email protected]> wrote: > Hi Sean, > > "Say I have P prefs, U users and I items. > Assume P/I prefs per item. The bottleneck of this stage is outputting > I2 vectors of size P/I, or about P*I output. > > What you're doing now bottlenecks in the last bit where you output for > each user some data for every pair of items. That's coarsely U * > (P/U)2 = U*P2 output. I think that P2 is killer." > > There's an error in the above math as U * (P/U)2 = P2 / U (not UP2) > which makes me guess that there won't be any improvements by mapping out > all item-item pairs in the beginning. > I showed the calculations to a mathematician and he was also of the > opinion that there wouldn't be any improvements. You'll find our > thoughts below, please check that we did not oversee something. I will > run the job with your patch in the next days to get us a real-life > comparison. > > Regards, > Sebastian > > ------------------------------------- > Calculated with some example numbers > > U = 1,000,000 users > I = 50,000 items > P = 50,000,000 preferences > > P*I = 250 billion > P2/U = 2,5 billion > > ------------------------------------- > Or stated as inequality: > > P2/U > PI > P/U > I > P > UI <-- obviously not true > > > Sean Owen schrieb: >> Well it's not hard to add something like UncenteredCosineSimilarity >> for sure, I don't mind. It's actually a matter of configuring the >> superclass to center or not. >> >> But it's also easy to center the data in the M/R. I agree it makes >> little difference in your case, and the effect is subtle. I can add >> centering, to at least have it in the code for consistency -- in >> addition to adding the implementation above for completeness. >> >> While I'm at it I think we might be able to simplify this item-item >> computation. A straightforward alternative is something like this: >> >> 1. Compute item vectors (using Vector output, ideally) in one M/R >> 2M. For each item-item pair output both vectors >> 2R. With both Vectors in hand easily compute the cosine measure >> >> It's quite simple, and lends itself to dropping in a different reducer >> stage to implement different similarities, which is great. >> >> The question is performance. Say I have P prefs, U users and I items. >> Assume P/I prefs per item. The bottleneck of this stage is outputting >> I^2 vectors of size P/I, or about P*I output. >> >> What you're doing now bottlenecks in the last bit where you output for >> each user some data for every pair of items. That's coarsely U * >> (P/U)^2 = U*P^2 output. I think that P^2 is killer. >> >> Double-check my thinking but if you like it I think I can >> significantly simplify this job. I can maybe make a patch for you to >> try. >> >> On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter >> <[email protected]> wrote: >> >>> However, I'm a little bit confused now, let me explain why I thought >>> that this additional similarity implementation would be necessary: Three >>> weeks ago, I submitted a patch computing the cosine item similarities >>> via map-reduce (MAHOUT-362), which is how I currently fill my database >>> table of precomputed item-item-similarities. Some days ago I needed to >>> precompute lots of recommendations offline via >>> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want >>> to connect my map-reduce code to the database. So I was in need of a >>> similarity implementation that would give me the same results on the fly >>> from a FileDataModel, which is how I came to create the >>> CosineItemSimilarity implementation. So my point here would be that if >>> the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not >>> center the data, a non-distributed implementation of that way of >>> computing the similarity should be available too, or the code in >>> org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to >>> center the data. >>> >>> You stated that centering the data "adjusts for a user's tendency to >>> rate high or low on average" which is certainly necessary when you >>> collect explicit ratings from your users. However in my usecase (a big >>> online-shopping platform) I unfortunately do not have explicit ratings >>> from the users, so I chose to interpret certain actions as ratings (I >>> recall this is called implicit ratings in the literature), e.g. a user >>> putting an item into their shopping basket as a rating of 3 or a user >>> purchasing an item as a rating of 5, like suggested in the "Making >>> Recommendations" chapter" of "Programming Collective Intelligence" by T. >>> Searagan. As far as I can judge (to be honest my mathematical knowledge >>> is kinda limited) there are no different interpretations of the rating >>> scala here as the values are fixed, so I thought that a centering of the >>> data would not be necessary. >>> >>> Regards, >>> Sebastian >>> >>> Sean Owen (JIRA) schrieb: >>> >>>> [ >>>> https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >>>> ] >>>> >>>> Sean Owen updated MAHOUT-387: >>>> ----------------------------- >>>> >>>> Status: Resolved (was: Patch Available) >>>> Assignee: Sean Owen >>>> Fix Version/s: 0.3 >>>> Resolution: Won't Fix >>>> >>>> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. >>>> In the case where the mean of each series is 0, the result is the same. >>>> The fastest way I know to see this is to just look at this form of the >>>> sample correlation: >>>> http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png >>>> ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are >>>> 0. You're left with sum(xi*yi) in the numerator, which is the dot product, >>>> and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the >>>> vector sizes. This is just the cosine of the angle between x and y. >>>> >>>> One can argue whether forcing the data to be centered is right. I think >>>> it's a good thing in all cases. It adjusts for a user's tendency to rate >>>> high or low on average. It also makes the computation simpler, and more >>>> consistent with Pearson (well, it makes it identical!). This has a good >>>> treatment: >>>> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation >>>> >>>> Only for this reason I'd mark this as won't-fix for the moment; the patch >>>> is otherwise nice. I'd personally like to hear more about why to not >>>> center if there's an argument for it. >>>> >>>> >>>> >>>>> Cosine item similarity implementation >>>>> ------------------------------------- >>>>> >>>>> Key: MAHOUT-387 >>>>> URL: https://issues.apache.org/jira/browse/MAHOUT-387 >>>>> Project: Mahout >>>>> Issue Type: New Feature >>>>> Components: Collaborative Filtering >>>>> Reporter: Sebastian Schelter >>>>> Assignee: Sean Owen >>>>> Fix For: 0.3 >>>>> >>>>> Attachments: MAHOUT-387.patch >>>>> >>>>> >>>>> I needed to compute the cosine similarity between two items when running >>>>> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find >>>>> an implementation (did I overlook it maybe?) so I created my own. I want >>>>> to share it here, in case you find it useful. >>>>> >>>>> >>>> >>> > >
