Sorry, typo, that's what I meant. yes the difference isn't *that* large!
It may be worse in practice since you have a few users with very many prefs.
It may also be beneficial to simply have one fewer phase and throw
around less data. I will also try to benchmark since really that's the
only way to know.

On Wed, Apr 28, 2010 at 3:01 PM, Sebastian Schelter
<[email protected]> wrote:
> Hi Sean,
>
> "Say I have P prefs, U users and I items.
> Assume P/I prefs per item. The bottleneck of this stage is outputting
> I2 vectors of size P/I, or about P*I output.
>
> What you're doing now bottlenecks in the last bit where you output for
> each user some data for every pair of items. That's coarsely U *
> (P/U)2 = U*P2 output. I think that P2 is killer."
>
> There's an error in the above math as U * (P/U)2 = P2 / U (not UP2)
> which makes me guess that there won't be any improvements by mapping out
> all item-item pairs in the beginning.
> I showed the calculations to a mathematician and he was also of the
> opinion that there wouldn't be any improvements. You'll find our
> thoughts below, please check that we did not oversee something. I will
> run the job with your patch in the next days to get us a real-life
> comparison.
>
> Regards,
> Sebastian
>
> -------------------------------------
> Calculated with some example numbers
>
> U = 1,000,000 users
> I = 50,000 items
> P = 50,000,000 preferences
>
> P*I = 250 billion
> P2/U = 2,5 billion
>
> -------------------------------------
> Or stated as inequality:
>
> P2/U > PI
> P/U > I
> P > UI <-- obviously not true
>
>
> Sean Owen schrieb:
>> Well it's not hard to add something like UncenteredCosineSimilarity
>> for sure, I don't mind. It's actually a matter of configuring the
>> superclass to center or not.
>>
>> But it's also easy to center the data in the M/R. I agree it makes
>> little difference in your case, and the effect is subtle. I can add
>> centering, to at least have it in the code for consistency -- in
>> addition to adding the implementation above for completeness.
>>
>> While I'm at it I think we might be able to simplify this item-item
>> computation. A straightforward alternative is something like this:
>>
>> 1. Compute item vectors (using Vector output, ideally) in one M/R
>> 2M. For each item-item pair output both vectors
>> 2R. With both Vectors in hand easily compute the cosine measure
>>
>> It's quite simple, and lends itself to dropping in a different reducer
>> stage to implement different similarities, which is great.
>>
>> The question is performance. Say I have P prefs, U users and I items.
>> Assume P/I prefs per item. The bottleneck of this stage is outputting
>> I^2 vectors of size P/I, or about P*I output.
>>
>> What you're doing now bottlenecks in the last bit where you output for
>> each user some data for every pair of items. That's coarsely U *
>> (P/U)^2 = U*P^2 output. I think that P^2 is killer.
>>
>> Double-check my thinking but if you like it I think I can
>> significantly simplify this job. I can maybe make a patch for you to
>> try.
>>
>> On Wed, Apr 28, 2010 at 9:05 AM, Sebastian Schelter
>> <[email protected]> wrote:
>>
>>> However, I'm a little bit confused now, let me explain why I thought
>>> that this additional similarity implementation would be necessary: Three
>>> weeks ago, I submitted a patch computing the cosine item similarities
>>> via map-reduce (MAHOUT-362), which is how I currently fill my database
>>> table of precomputed item-item-similarities. Some days ago I needed to
>>> precompute lots of recommendations offline via
>>> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob and didn't want
>>> to connect my map-reduce code to the database. So I was in need of a
>>> similarity implementation that would give me the same results on the fly
>>> from a FileDataModel, which is how I came to create the
>>> CosineItemSimilarity implementation. So my point here would be that if
>>> the code in org.apache.mahout.cf.taste.hadoop.similarity.item does not
>>> center the data, a non-distributed implementation of that way of
>>> computing the similarity should be available too, or the code in
>>> org.apache.mahout.cf.taste.hadoop.similarity.item should be changed to
>>> center the data.
>>>
>>> You stated that centering the data "adjusts for a user's tendency to
>>> rate high or low on average" which is certainly necessary when you
>>> collect explicit ratings from your users. However in my usecase (a big
>>> online-shopping platform) I unfortunately do not have explicit ratings
>>> from the users, so I chose to interpret certain actions as ratings (I
>>> recall this is called implicit ratings in the literature), e.g. a user
>>> putting an item into their shopping basket as a rating of 3 or a user
>>> purchasing an item as a rating of 5, like suggested in the "Making
>>> Recommendations" chapter" of "Programming Collective Intelligence" by T.
>>> Searagan. As far as I can judge (to be honest my mathematical knowledge
>>> is kinda limited) there are no different interpretations of the rating
>>> scala here as the values are fixed, so I thought that a centering of the
>>> data would not be necessary.
>>>
>>> Regards,
>>> Sebastian
>>>
>>> Sean Owen (JIRA) schrieb:
>>>
>>>>      [ 
>>>> https://issues.apache.org/jira/browse/MAHOUT-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>>>  ]
>>>>
>>>> Sean Owen updated MAHOUT-387:
>>>> -----------------------------
>>>>
>>>>            Status: Resolved  (was: Patch Available)
>>>>          Assignee: Sean Owen
>>>>     Fix Version/s: 0.3
>>>>        Resolution: Won't Fix
>>>>
>>>> Yes like Jeff said, this actually exists as PearsonCorrelationSimilarity. 
>>>> In the case where the mean of each series is 0, the result is the same. 
>>>> The fastest way I know to see this is to just look at this form of the 
>>>> sample correlation: 
>>>> http://upload.wikimedia.org/math/c/a/6/ca68fbe94060a2591924b380c9bc4e27.png
>>>>  ... and note that sum(xi) = sum (yi) = 0 when the mean of xi and yi are 
>>>> 0. You're left with sum(xi*yi) in the numerator, which is the dot product, 
>>>> and sqrt(sum(xi^2)) * sqrt(sum(yi^2)) in the denominator, which are the 
>>>> vector sizes. This is just the cosine of the angle between x and y.
>>>>
>>>> One can argue whether forcing the data to be centered is right. I think 
>>>> it's a good thing in all cases. It adjusts for a user's tendency to rate 
>>>> high or low on average. It also makes the computation simpler, and more 
>>>> consistent with Pearson (well, it makes it identical!). This has a good 
>>>> treatment:
>>>> http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient#Geometric_interpretation
>>>>
>>>> Only for this reason I'd mark this as won't-fix for the moment; the patch 
>>>> is otherwise nice. I'd personally like to hear more about why to not 
>>>> center if there's an argument for it.
>>>>
>>>>
>>>>
>>>>> Cosine item similarity implementation
>>>>> -------------------------------------
>>>>>
>>>>>                 Key: MAHOUT-387
>>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-387
>>>>>             Project: Mahout
>>>>>          Issue Type: New Feature
>>>>>          Components: Collaborative Filtering
>>>>>            Reporter: Sebastian Schelter
>>>>>            Assignee: Sean Owen
>>>>>             Fix For: 0.3
>>>>>
>>>>>         Attachments: MAHOUT-387.patch
>>>>>
>>>>>
>>>>> I needed to compute the cosine similarity between two items when running 
>>>>> org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob, I couldn't find 
>>>>> an implementation (did I overlook it maybe?) so I created my own. I want 
>>>>> to share it here, in case you find it useful.
>>>>>
>>>>>
>>>>
>>>
>
>

Reply via email to