Hi there,
I invested a lot of work in improving RowSimilarityJob recently and I'm
writing to get some feedback on my intermediate results. I also dug out
an old mailinglist discussion about the matrix form of itembased
recommendation that helped me a lot, I recommend reading this to anyone
who wants to dive into the details here:
http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html
I managed to rework the algorithm as it was proposed in the mailinglist
discussion before. The main goal was to drastically reduce the amount of
data that needs to be shipped over the network in the cooccurrence
computation phase. This is accomplished by making heavy use of a
combiner for all supported similarity measures, by not attaching the
"norms" of the rows to the cooccurrence pairs anymore and by emitting
cooccurrence "stripes" instead of cooccurrence pairs.
A similarity measure must now be expressed by four simple functions that
are invoked in different stages of the algorithm
normalize(row) (invoked once at the beginning, can transform the input
row vectors)
norm(row) (invoked once at the beginning, maps an input vector a double)
aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB)
(invoked for each pair of cooccurring elements of two input vectors, can
map them to a double, these doubles are summed then)
similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)
(invoked at the end with the results of the other functions, should
compute the final similarity)
A short list how our commonly used measures can be expressed this way:
cosine:
normalize(...) normalizes the vectors to unit length
aggregate(...) multiplies valueA and valueB
similarity(...) just needs to return the summedAggregations (which is
the dot-product)
pearson:
same as cosine, but normalize(...) centers the vector elements first
(without touching 0s as this would make the input matrix dense)
cooccurrence-count / overlap:
aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations only
tanimoto/jaccard:
norm(...) returns the l0 norm (number of non-zero elements)
aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations / (normOfRowA + normOfRowB -
summedDot);
loglikelihood:
same as tanimoto, but we compute LLR in similarity(...) from
summedAggregations, normOfRowA,normOfRowB,numColumns
cityblock/manhattan:
same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB
- 2 * summedAggregations);
euclidean distance:
norm(...) returns the summed squares of the input vector
aggregate(...) aggregate(...) multiplies valueA and valueB
similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots +
normB)
unfortunately we cant use the formula from
o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which
also incorporates the number of "overlapping" dimensions.
I'm currently running tests on a small cluster using several rating
datasets (18M music ratings from lastfm, 100M movie ratings from
netflix, 250M music ratings from yahoo's KDD cup), results are pretty ok
so far, but I'm still searching for ways to improve the algorithm.
One thing I'm currently looking into is how to sample the input. Ted has
stated that you usually only need to look at a few hundred or thousand
ratings per item as you don't learn anything new from the rest. Would it
be sufficient to randomly sample the ratings of an item then? That's
what I'm currently doing but I wonder whether there are more clever ways
to do this.
There is a sampling heuristic in the distributed recommender that caught
my attention a long time ago, can anyone give any details on it? I
basically understand what it's doing but not exactly why that's a good
way to sample. Here's the original implementation:
http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188
Another issue I'm looking into is whether it would help to specify a
similarity threshold. There is some work on "All Pairs Similarity
Search" which is closely related to cooccurrence computation that seems
to have some nice pruning strategies based on a threshold.
For example imagine your similarity measure is cooccurrence count and
you only want to have pairs with a count of 5, than you would not need
to consider all pairs where one the vectors has less than 5 non zero
entries.
Something like this could be a way to further improve the algorithm's
running time, especially as the resulting similarity values seem to
follow a heavily tailed distribution. It should be possible to find
pruning heuristics for all measures, yet I don't have an idea yet how to
find a good a-priori threshold.
I think about writing a small paper about my works on that (still need
to see if I can "sell" it to my professor :) ). I hope everyone who
contributed to the discussion here is ok with that, I don't want to
"steal" ideas.
--sebastian
On 20.07.2011 17:40, Ted Dunning wrote:
Not if you are doing Euclidean distance.
On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA)<[email protected]>wrote:
[
https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]
Grant Ingersoll commented on MAHOUT-767:
----------------------------------------
Why do we need the norms? Why can't we assume the user has already
normalized?
Improve RowSimilarityJob performance for count-based distance measures
----------------------------------------------------------------------
Key: MAHOUT-767
URL: https://issues.apache.org/jira/browse/MAHOUT-767
Project: Mahout
Issue Type: Improvement
Reporter: Grant Ingersoll
Fix For: 0.6
(See
http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for
background)
Currently, the RowSimilarityJob defers the calculation of the similarity
metric until the reduce phase, while emitting many Cooccurrence objects.
For similarity metrics that are algebraic (
http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
be able to do much of the computation during the Mapper part of this phase
and also take advantage of a Combiner.
We should use a marker interface to know whether a similarity metric is
algebraic and then make use of an appropriate Mapper implementation,
otherwise we can fall back on our existing implementation.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira