Re: Problems with AbstractSimilarity

Sean Owen Mon, 26 Apr 2010 06:59:20 -0700

On Mon, Apr 26, 2010 at 2:02 PM, Mattias Hilliges <hilli...@neofonie.de> wrote:
> Hi,
> i detected the following behaviour, that seems a bit strange to me:
> Be v=(v1, v2,...,vn) and w=(w1, w2, ...,wm) vectors, that are used to
> compute the similarity between two items/users. If all vi, that overlap
> with w (this means vi!=0 and wi!=0), are equal, and if all wj, that
> overlap with v, are equal, no euclidean or pearson similarity can be
> computed.


The Pearson correlation is undefined on two series if either one has
all the same values. This is because the standard deviation of the
series is 0, and the correlation computation involves scaling by
(dividing by) the standard deviations.

For Euclidean, the distance is normalized by the sum of the sizes of
the preference vectors. In this case, both those sizes are 0, since
the data is centered (mean 0) and these equal values both map to
(0,0). It's quite a corner case.

This step is a bit questionable, and old, and could be removed. The
idea is to not let one user's scale of preference values affect the
result -- whether I rate on a 1 to 5 or 10 to 50 scale. This is for
consistency with Pearson's behavior. But I think you could easily
argue it's not necessary or even desirable to emulate this property.

>
> The problem is, that "double computeResult(int n, double sumXY, double
> sumX2, double sumY2, double sumXYdiff2)" in the corresponding subclass
> of AbstractSimilarity is called with parameters sumXY=sumX2=sumY2=0 and
> therefore returns Double.NaN. This behaviour contradicts the behaviour
> described in the book "Mahout in Action", p.49. The last complete
> sentence here is: "Note that we were able compute some notion of
> similarity for all pairs of users here, whereas the Pearson correlation
> couldn't produce an answer for users 1 and 3." Because of the described
> problem, the euclidean algorithm can't produce an answer either. This is
> a special case of the described problem, where there is only one overlap.

True, well, the book is presenting a simplified version of the
Euclidean similarity, without anything else that happens in the real
code like centering or normalizing for dimension. The book is correct
about the simplified version, but its point would not be correct of
the actual implementation as it stands now.

I don't think that really harms the point, that funny things happen
with sparse data, but it's not ideal. And, given that the cause is a
normalization which can arguably be removed, I'd be fine removing that
normalization (unless someone stops me for a good reason). Then it
would all be consistent.

Sean

Re: Problems with AbstractSimilarity

Reply via email to