Re: RowSimilarityJob improvements

Grant Ingersoll Sat, 13 Aug 2011 03:26:09 -0700

Awesome!  Some thoughts inline.

On Aug 13, 2011, at 5:31 AM, Sebastian Schelter wrote:


> Hi there,
> 
> I invested a lot of work in improving RowSimilarityJob recently and I'm 
> writing to get some feedback on my intermediate results. I also dug out an 
> old mailinglist discussion about the matrix form of itembased recommendation 
> that helped me a lot, I recommend reading this to anyone who wants to dive 
> into the details here:
> 
> http://lucene.472066.n3.nabble.com/Re-Taste-GenericItemBasedRecommender-td641062.html
> 
> I managed to rework the algorithm as it was proposed in the mailinglist 
> discussion before. The main goal was to drastically reduce the amount of data 
> that needs to be shipped over the network in the cooccurrence computation 
> phase. This is accomplished by making heavy use of a combiner for all 
> supported similarity measures, by not attaching the "norms" of the rows to 
> the cooccurrence pairs anymore and by emitting cooccurrence "stripes" instead 
> of cooccurrence pairs.
> 
> A similarity measure must now be expressed by four simple functions that are 
> invoked in different stages of the algorithm
> 
> normalize(row) (invoked once at the beginning, can transform the input row 
> vectors)
> 
> norm(row) (invoked once at the beginning, maps an input vector a double)
> 
> aggregate(nonZeroValueInDimensionXFromRowA,nonZeroValueInDimensionXFromRowB) 
> (invoked for each pair of cooccurring elements of two input vectors, can map 
> them to a double, these doubles are summed then)
> 
> similarity(summedAggregations,normOfRowA,normOfRowB,numColumns)
> (invoked at the end with the results of the other functions, should compute 
> the final similarity)
> 
> A short list how our commonly used measures can be expressed this way:
> 
> cosine:
> 
> normalize(...) normalizes the vectors to unit length
> aggregate(...) multiplies valueA and valueB
> similarity(...) just needs to return the summedAggregations (which is the 
> dot-product)
> 
> pearson:
> 
> same as cosine, but normalize(...) centers the vector elements first (without 
> touching 0s as this would make the input matrix dense)
> 
> cooccurrence-count / overlap:
> 
> aggregate(...) returns 1 as we only need to count
> similarity(...) returns summedAggregations only
> 
> tanimoto/jaccard:
> 
> norm(...) returns the l0 norm (number of non-zero elements)
> aggregate(...) returns 1 as we only need to count
> similarity(...) returns summedAggregations / (normOfRowA + normOfRowB - 
> summedDot);
> 
> loglikelihood:
> 
> same as tanimoto, but we compute LLR in similarity(...) from 
> summedAggregations, normOfRowA,normOfRowB,numColumns
> 
> 
> cityblock/manhattan:
> 
> same as tanimoto, but similarity(...) returns 1.0 / (1.0 + normA + normB - 2 
> * summedAggregations);
> 
> euclidean distance:
> 
> norm(...) returns the summed squares of the input vector
> aggregate(...) aggregate(...) multiplies valueA and valueB
> similarity(...) returns 1.0 - 1.0 / (1.0 + Math.sqrt(normA - 2 * dots + normB)
> 
> unfortunately we cant use the formula from 
> o.a.m.cf.taste.impl.similarity.EuclideanDistanceSimilarity here which also 
> incorporates the number of "overlapping" dimensions.
> 
> 
> I'm currently running tests on a small cluster using several rating datasets 
> (18M music ratings from lastfm, 100M movie ratings from netflix, 250M music 
> ratings from yahoo's KDD cup), results are pretty ok so far, but I'm still 
> searching for ways to improve the algorithm.
> 
> One thing I'm currently looking into is how to sample the input. Ted has 
> stated that you usually only need to look at a few hundred or thousand 
> ratings per item as you don't learn anything new from the rest. Would it be 
> sufficient to randomly sample the ratings of an item then? That's what I'm 
> currently doing but I wonder whether there are more clever ways to do this.

I assume this will be an option right?  Row Similarity obviously goes beyond 
just recommenders and sampling shouldn't be required.


> 
> There is a sampling heuristic in the distributed recommender that caught my 
> attention a long time ago, can anyone give any details on it? I basically 
> understand what it's doing but not exactly why that's a good way to sample. 
> Here's the original implementation:
> 
> http://svn.apache.org/viewvc/mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/UserVectorToCooccurrenceMapper.java?revision=950049&view=co&pathrev=96188
> 
> Another issue I'm looking into is whether it would help to specify a 
> similarity threshold. There is some work on "All Pairs Similarity Search" 
> which is closely related to cooccurrence computation that seems to have some 
> nice pruning strategies based on a threshold.

+1

> 
> For example imagine your similarity measure is cooccurrence count and you 
> only want to have pairs with a count of 5, than you would not need to 
> consider all pairs where one the vectors has less than 5 non zero entries.
> 
> Something like this could be a way to further improve the algorithm's running 
> time, especially as the resulting similarity values seem to follow a heavily 
> tailed distribution. It should be possible to find pruning heuristics for all 
> measures, yet I don't have an idea yet how to find a good a-priori threshold.

I could also see pruning at the very end on the final distance score which 
would potentially save some IO.


> 
> I think about writing a small paper about my works on that (still need to see 
> if I can "sell" it to my professor :) ). I hope everyone who contributed to 
> the discussion here is ok with that, I don't want to "steal" ideas.
> 

Well, you can always be the primary and add others.  :-)


> 
> --sebastian
> 
> On 20.07.2011 17:40, Ted Dunning wrote:
>> Not if you are doing Euclidean distance.
>> 
>> On Wed, Jul 20, 2011 at 5:41 AM, Grant Ingersoll 
>> (JIRA)<[email protected]>wrote:
>> 
>>> 
>>>    [
>>> https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068331#comment-13068331]
>>> 
>>> Grant Ingersoll commented on MAHOUT-767:
>>> ----------------------------------------
>>> 
>>> Why do we need the norms?  Why can't we assume the user has already
>>> normalized?
>>> 
>>>> Improve RowSimilarityJob performance for count-based distance measures
>>>> ----------------------------------------------------------------------
>>>> 
>>>>                 Key: MAHOUT-767
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
>>>>             Project: Mahout
>>>>          Issue Type: Improvement
>>>>            Reporter: Grant Ingersoll
>>>>             Fix For: 0.6
>>>> 
>>>> 
>>>> (See
>>> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7for
>>>  background)
>>>> Currently, the RowSimilarityJob defers the calculation of the similarity
>>> metric until the reduce phase, while emitting many Cooccurrence objects.
>>>  For similarity metrics that are algebraic (
>>> http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should
>>> be able to do much of the computation during the Mapper part of this phase
>>> and also take advantage of a Combiner.
>>>> We should use a marker interface to know whether a similarity metric is
>>> algebraic and then make use of an appropriate Mapper implementation,
>>> otherwise we can fall back on our existing implementation.
>>> 
>>> --
>>> This message is automatically generated by JIRA.
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>> 
>>> 
>>> 
>> 
> 

--------------------------
Grant Ingersoll

Re: RowSimilarityJob improvements

Reply via email to