RowSimilarityJob improvements

Sebastian Schelter Sat, 13 Aug 2011 02:32:31 -0700

Hi there,

I invested a lot of work in improving RowSimilarityJob recently and I'mwriting to get some feedback on my intermediate results. I also dug outan old mailinglist discussion about the matrix form of itembasedrecommendation that helped me a lot, I recommend reading this to anyonewho wants to dive into the details here:


http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html

I managed to rework the algorithm as it was proposed in the mailinglistdiscussion before. The main goal was to drastically reduce the amount ofdata that needs to be shipped over the network in the cooccurrencecomputation phase. This is accomplished by making heavy use of acombiner for all supported similarity measures, by not attaching the"norms" of the rows to the cooccurrence pairs anymore and by emittingcooccurrence "stripes" instead of cooccurrence pairs.

A similarity measure must now be expressed by four simple functions thatare invoked in different stages of the algorithm

normalize(row) (invoked once at the beginning, can transform the inputrow vectors)


norm(row) (invoked once at the beginning, maps an input vector a double)

aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB)(invoked for each pair of cooccurring elements of two input vectors, canmap them to a double, these doubles are summed then)


similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)

(invoked at the end with the results of the other functions, shouldcompute the final similarity)


A short list how our commonly used measures can be expressed this way:

cosine:

normalize(...) normalizes the vectors to unit length
aggregate(...) multiplies valueA and valueB

similarity(...) just needs to return the summedAggregations (which isthe dot-product)


pearson:

same as cosine, but normalize(...) centers the vector elements first(without touching 0s as this would make the input matrix dense)


cooccurrence-count / overlap:

aggregate(...) returns 1 as we only need to count
similarity(...) returns summedAggregations only

tanimoto/jaccard:

norm(...) returns the l0 norm (number of non-zero elements)
aggregate(...) returns 1 as we only need to count

similarity(...) returns summedAggregations / (normOfRowA + normOfRowB -summedDot);


loglikelihood:

same as tanimoto, but we compute LLR in similarity(...) fromsummedAggregations, normOfRowA,normOfRowB,numColumns



cityblock/manhattan:

same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB- 2 * summedAggregations);


euclidean distance:

norm(...) returns the summed squares of the input vector
aggregate(...) aggregate(...) multiplies valueA and valueB

similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots +normB)

unfortunately we cant use the formula fromo.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here whichalso incorporates the number of "overlapping" dimensions.

I'm currently running tests on a small cluster using several ratingdatasets (18M music ratings from lastfm, 100M movie ratings fromnetflix, 250M music ratings from yahoo's KDD cup), results are pretty okso far, but I'm still searching for ways to improve the algorithm.

One thing I'm currently looking into is how to sample the input. Ted hasstated that you usually only need to look at a few hundred or thousandratings per item as you don't learn anything new from the rest. Would itbe sufficient to randomly sample the ratings of an item then? That'swhat I'm currently doing but I wonder whether there are more clever waysto do this.

There is a sampling heuristic in the distributed recommender that caughtmy attention a long time ago, can anyone give any details on it? Ibasically understand what it's doing but not exactly why that's a goodway to sample. Here's the original implementation:


http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188

Another issue I'm looking into is whether it would help to specify asimilarity threshold. There is some work on "All Pairs SimilaritySearch" which is closely related to cooccurrence computation that seemsto have some nice pruning strategies based on a threshold.

For example imagine your similarity measure is cooccurrence count andyou only want to have pairs with a count of 5, than you would not needto consider all pairs where one the vectors has less than 5 non zeroentries.

Something like this could be a way to further improve the algorithm'srunning time, especially as the resulting similarity values seem tofollow a heavily tailed distribution. It should be possible to findpruning heuristics for all measures, yet I don't have an idea yet how tofind a good a-priori threshold.

I think about writing a small paper about my works on that (still needto see if I can "sell" it to my professor :) ). I hope everyone whocontributed to the discussion here is ok with that, I don't want to"steal" ideas.



--sebastian

On 20.07.2011 17:40, Ted Dunning wrote:

Not if you are doing Euclidean distance.

On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll (JIRA)<[email protected]>wrote:


    [
https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]

Grant Ingersoll commented on MAHOUT-767:
----------------------------------------

Why do we need the norms?  Why can't we assume the user has already
normalized?

Improve RowSimilarityJob performance for count-based distance measures
----------------------------------------------------------------------

                 Key: MAHOUT-767
                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
             Project: Mahout
          Issue Type: Improvement
            Reporter: Grant Ingersoll
             Fix For: 0.6


(See

http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for
 background)

Currently, the RowSimilarityJob defers the calculation of the similarity

metric until the reduce phase, while emitting many Cooccurrence objects.
  For similarity metrics that are algebraic (
http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
be able to do much of the computation during the Mapper part of this phase
and also take advantage of a Combiner.

We should use a marker interface to know whether a similarity metric is

algebraic and then make use of an appropriate Mapper implementation,
otherwise we can fall back on our existing implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

RowSimilarityJob improvements

Reply via email to