Hi there,

I invested a lot of work in improving RowSimilarityJob recently and I'm writing to get some feedback on my intermediate results. I also dug out an old mailinglist discussion about the matrix form of itembased recommendation that helped me a lot, I recommend reading this to anyone who wants to dive into the details here:

http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html

I managed to rework the algorithm as it was proposed in the mailinglist discussion before. The main goal was to drastically reduce the amount of data that needs to be shipped over the network in the cooccurrence computation phase. This is accomplished by making heavy use of a combiner for all supported similarity measures, by not attaching the "norms" of the rows to the cooccurrence pairs anymore and by emitting cooccurrence "stripes" instead of cooccurrence pairs.

A similarity measure must now be expressed by four simple functions that are invoked in different stages of the algorithm

normalize(row) (invoked once at the beginning, can transform the input row vectors)

norm(row) (invoked once at the beginning, maps an input vector a double)

aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB) (invoked for each pair of cooccurring elements of two input vectors, can map them to a double, these doubles are summed then)

similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)
(invoked at the end with the results of the other functions, should compute the final similarity)

A short list how our commonly used measures can be expressed this way:

cosine:

normalize(...) normalizes the vectors to unit length
aggregate(...) multiplies valueA and valueB
similarity(...) just needs to return the summedAggregations (which is the dot-product)

pearson:

same as cosine, but normalize(...) centers the vector elements first (without touching 0s as this would make the input matrix dense)

cooccurrence-count / overlap:

aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations only

tanimoto/jaccard:

norm(...) returns the l0 norm (number of non-zero elements)
aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations / (normOfRowA + normOfRowB - summedDot);

loglikelihood:

same as tanimoto, but we compute LLR in similarity(...) from summedAggregations, normOfRowA,normOfRowB,numColumns


cityblock/manhattan:

same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB - 2 * summedAggregations);

euclidean distance:

norm(...) returns the summed squares of the input vector
aggregate(...) aggregate(...) multiplies valueA and valueB
similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots + normB)

unfortunately we cant use the formula from o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which also incorporates the number of "overlapping" dimensions.


I'm currently running tests on a small cluster using several rating datasets (18M music ratings from lastfm, 100M movie ratings from netflix, 250M music ratings from yahoo's KDD cup), results are pretty ok so far, but I'm still searching for ways to improve the algorithm.

One thing I'm currently looking into is how to sample the input. Ted has stated that you usually only need to look at a few hundred or thousand ratings per item as you don't learn anything new from the rest. Would it be sufficient to randomly sample the ratings of an item then? That's what I'm currently doing but I wonder whether there are more clever ways to do this.

There is a sampling heuristic in the distributed recommender that caught my attention a long time ago, can anyone give any details on it? I basically understand what it's doing but not exactly why that's a good way to sample. Here's the original implementation:

http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188

Another issue I'm looking into is whether it would help to specify a similarity threshold. There is some work on "All Pairs Similarity Search" which is closely related to cooccurrence computation that seems to have some nice pruning strategies based on a threshold.

For example imagine your similarity measure is cooccurrence count and you only want to have pairs with a count of 5, than you would not need to consider all pairs where one the vectors has less than 5 non zero entries.

Something like this could be a way to further improve the algorithm's running time, especially as the resulting similarity values seem to follow a heavily tailed distribution. It should be possible to find pruning heuristics for all measures, yet I don't have an idea yet how to find a good a-priori threshold.

I think about writing a small paper about my works on that (still need to see if I can "sell" it to my professor :) ). I hope everyone who contributed to the discussion here is ok with that, I don't want to "steal" ideas.


--sebastian

On 20.07.2011 17:40, Ted Dunning wrote:
Not if you are doing Euclidean distance.

On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA)<[email protected]>wrote:


    [
https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]

Grant Ingersoll commented on MAHOUT-767:
----------------------------------------

Why do we need the norms?  Why can't we assume the user has already
normalized?

Improve RowSimilarityJob performance for count-based distance measures
----------------------------------------------------------------------

                 Key: MAHOUT-767
                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
             Project: Mahout
          Issue Type: Improvement
            Reporter: Grant Ingersoll
             Fix For: 0.6


(See
http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for
 background)
Currently, the RowSimilarityJob defers the calculation of the similarity
metric until the reduce phase, while emitting many Cooccurrence objects.
  For similarity metrics that are algebraic (
http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
be able to do much of the computation during the Mapper part of this phase
and also take advantage of a Combiner.
We should use a marker interface to know whether a similarity metric is
algebraic and then make use of an appropriate Mapper implementation,
otherwise we can fall back on our existing implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira





Reply via email to