On Mon, Apr 26, 2010 at 2:02 PM, Mattias Hilliges <hilli...@neofonie.de> wrote: > Hi, > i detected the following behaviour, that seems a bit strange to me: > Be v=(v1, v2,...,vn) and w=(w1, w2, ...,wm) vectors, that are used to > compute the similarity between two items/users. If all vi, that overlap > with w (this means vi!=0 and wi!=0), are equal, and if all wj, that > overlap with v, are equal, no euclidean or pearson similarity can be > computed.
The Pearson correlation is undefined on two series if either one has all the same values. This is because the standard deviation of the series is 0, and the correlation computation involves scaling by (dividing by) the standard deviations. For Euclidean, the distance is normalized by the sum of the sizes of the preference vectors. In this case, both those sizes are 0, since the data is centered (mean 0) and these equal values both map to (0,0). It's quite a corner case. This step is a bit questionable, and old, and could be removed. The idea is to not let one user's scale of preference values affect the result -- whether I rate on a 1 to 5 or 10 to 50 scale. This is for consistency with Pearson's behavior. But I think you could easily argue it's not necessary or even desirable to emulate this property. > > The problem is, that "double computeResult(int n, double sumXY, double > sumX2, double sumY2, double sumXYdiff2)" in the corresponding subclass > of AbstractSimilarity is called with parameters sumXY=sumX2=sumY2=0 and > therefore returns Double.NaN. This behaviour contradicts the behaviour > described in the book "Mahout in Action", p.49. The last complete > sentence here is: "Note that we were able compute some notion of > similarity for all pairs of users here, whereas the Pearson correlation > couldn't produce an answer for users 1 and 3." Because of the described > problem, the euclidean algorithm can't produce an answer either. This is > a special case of the described problem, where there is only one overlap. True, well, the book is presenting a simplified version of the Euclidean similarity, without anything else that happens in the real code like centering or normalizing for dimension. The book is correct about the simplified version, but its point would not be correct of the actual implementation as it stands now. I don't think that really harms the point, that funny things happen with sparse data, but it's not ideal. And, given that the cause is a normalization which can arguably be removed, I'd be fine removing that normalization (unless someone stops me for a good reason). Then it would all be consistent. Sean