[
https://issues.apache.org/jira/browse/MAHOUT-389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Schelter updated MAHOUT-389:
--------------------------------------
Attachment: MAHOUT-389-2.patch
There might be cases where it makes sense to look not only at co-ratings, e.g.
imagine you have 3 products: A, B and C
Let's say the pairs A,B and A,C have the same co-ratings (the same users bought
them), but B is a topseller, which is bought by lots of people and C is a niche
product, which only sells rarely.
A cosine which includes the zero assumption would decrease the value for the
topseller and prefer the niche product, which might be a good thing depending
on your use case.
But I definitely see your point here that the assumption is generally not
holding and I also think that the distributed version should be modified.
I attached a patch with a first proposal how this could be managed.
I tried to refactor the similarity computation out of the map-reduce code and
make it possible to implement different similarity functions that have to
follow this scheme:
* in a early stage of the process, the similarity implementation can compute a
weight (a single double) for each item-vector
* in the end, it is given all co-ratings and the previously computed weights
for each item-pair that has at least one co-rating
That should be sufficient to compute centered pearson-correlation as well as
cosine or tanimoto coefficients.
I hope it's understandable what I'm trying to propose here, taking a look at
org.apache.mahout.cf.taste.hadoop.similarity.DistributedSimilarity together
with DistributedPearsonCorrelationSimilarity and
DistributedUncenteredZeroAssumingCosineSimilarity will hopefully help to get a
clearer picture. These implementations are merely for demonstration purposes,
they could be merged with the already existing non-distributed implementations
in case you like the approach described here.
> UncenteredCosineSimilarity
> ---------------------------
>
> Key: MAHOUT-389
> URL: https://issues.apache.org/jira/browse/MAHOUT-389
> Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
> Reporter: Sebastian Schelter
> Priority: Minor
> Attachments: MAHOUT-389-2.patch, MAHOUT-389.patch
>
>
> org.apache.mahout.cf.taste.impl.similarity.UncenteredCosineSimilarity only
> computes the cosine distance between those components of the vectors where
> both vectors have a value greater zero.
> This is inconsistent with the definition of the cosine (correct me if I'm
> wrong) and is inconsistent with the distributed cosine similarity computation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.