Awesome! Some thoughts inline. On Aug 13, 2011, at 5:31 AM, Sebastian Schelter wrote:
> Hi there, > > I invested a lot of work in improving RowSimilarityJob recently and I'm > writing to get some feedback on my intermediate results. I also dug out an > old mailinglist discussion about the matrix form of itembased recommendation > that helped me a lot, I recommend reading this to anyone who wants to dive > into the details here: > > http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html > > I managed to rework the algorithm as it was proposed in the mailinglist > discussion before. The main goal was to drastically reduce the amount of data > that needs to be shipped over the network in the cooccurrence computation > phase. This is accomplished by making heavy use of a combiner for all > supported similarity measures, by not attaching the "norms" of the rows to > the cooccurrence pairs anymore and by emitting cooccurrence "stripes" instead > of cooccurrence pairs. > > A similarity measure must now be expressed by four simple functions that are > invoked in different stages of the algorithm > > normalize(row) (invoked once at the beginning, can transform the input row > vectors) > > norm(row) (invoked once at the beginning, maps an input vector a double) > > aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB) > (invoked for each pair of cooccurring elements of two input vectors, can map > them to a double, these doubles are summed then) > > similarity(summedAggregations,normOfRowA,normOfRowB,numColumns) > (invoked at the end with the results of the other functions, should compute > the final similarity) > > A short list how our commonly used measures can be expressed this way: > > cosine: > > normalize(...) normalizes the vectors to unit length > aggregate(...) multiplies valueA and valueB > similarity(...) just needs to return the summedAggregations (which is the > dot-product) > > pearson: > > same as cosine, but normalize(...) centers the vector elements first (without > touching 0s as this would make the input matrix dense) > > cooccurrence-count / overlap: > > aggregate(...) returns 1 as we only need to count > similarity(...) returns summedAggregations only > > tanimoto/jaccard: > > norm(...) returns the l0 norm (number of non-zero elements) > aggregate(...) returns 1 as we only need to count > similarity(...) returns summedAggregations / (normOfRowA + normOfRowB - > summedDot); > > loglikelihood: > > same as tanimoto, but we compute LLR in similarity(...) from > summedAggregations, normOfRowA,normOfRowB,numColumns > > > cityblock/manhattan: > > same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB - 2 > * summedAggregations); > > euclidean distance: > > norm(...) returns the summed squares of the input vector > aggregate(...) aggregate(...) multiplies valueA and valueB > similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots + normB) > > unfortunately we cant use the formula from > o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which also > incorporates the number of "overlapping" dimensions. > > > I'm currently running tests on a small cluster using several rating datasets > (18M music ratings from lastfm, 100M movie ratings from netflix, 250M music > ratings from yahoo's KDD cup), results are pretty ok so far, but I'm still > searching for ways to improve the algorithm. > > One thing I'm currently looking into is how to sample the input. Ted has > stated that you usually only need to look at a few hundred or thousand > ratings per item as you don't learn anything new from the rest. Would it be > sufficient to randomly sample the ratings of an item then? That's what I'm > currently doing but I wonder whether there are more clever ways to do this. I assume this will be an option right? Row Similarity obviously goes beyond just recommenders and sampling shouldn't be required. > > There is a sampling heuristic in the distributed recommender that caught my > attention a long time ago, can anyone give any details on it? I basically > understand what it's doing but not exactly why that's a good way to sample. > Here's the original implementation: > > http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188 > > Another issue I'm looking into is whether it would help to specify a > similarity threshold. There is some work on "All Pairs Similarity Search" > which is closely related to cooccurrence computation that seems to have some > nice pruning strategies based on a threshold. +1 > > For example imagine your similarity measure is cooccurrence count and you > only want to have pairs with a count of 5, than you would not need to > consider all pairs where one the vectors has less than 5 non zero entries. > > Something like this could be a way to further improve the algorithm's running > time, especially as the resulting similarity values seem to follow a heavily > tailed distribution. It should be possible to find pruning heuristics for all > measures, yet I don't have an idea yet how to find a good a-priori threshold. I could also see pruning at the very end on the final distance score which would potentially save some IO. > > I think about writing a small paper about my works on that (still need to see > if I can "sell" it to my professor :) ). I hope everyone who contributed to > the discussion here is ok with that, I don't want to "steal" ideas. > Well, you can always be the primary and add others. :-) > > --sebastian > > On 20.07.2011 17:40, Ted Dunning wrote: >> Not if you are doing Euclidean distance. >> >> On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll >> (JIRA)<[email protected]>wrote: >> >>> >>> [ >>> https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331] >>> >>> Grant Ingersoll commented on MAHOUT-767: >>> ---------------------------------------- >>> >>> Why do we need the norms? Why can't we assume the user has already >>> normalized? >>> >>>> Improve RowSimilarityJob performance for count-based distance measures >>>> ---------------------------------------------------------------------- >>>> >>>> Key: MAHOUT-767 >>>> URL: https://issues.apache.org/jira/browse/MAHOUT-767 >>>> Project: Mahout >>>> Issue Type: Improvement >>>> Reporter: Grant Ingersoll >>>> Fix For: 0.6 >>>> >>>> >>>> (See >>> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for >>> background) >>>> Currently, the RowSimilarityJob defers the calculation of the similarity >>> metric until the reduce phase, while emitting many Cooccurrence objects. >>> For similarity metrics that are algebraic ( >>> http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should >>> be able to do much of the computation during the Mapper part of this phase >>> and also take advantage of a Combiner. >>>> We should use a marker interface to know whether a similarity metric is >>> algebraic and then make use of an appropriate Mapper implementation, >>> otherwise we can fall back on our existing implementation. >>> >>> -- >>> This message is automatically generated by JIRA. >>> For more information on JIRA, see: http://www.atlassian.com/software/jira >>> >>> >>> >> > -------------------------- Grant Ingersoll
