Re: RowSimilarity ?'s

Sebastian Schelter Thu, 14 Jul 2011 10:09:59 -0700

Hi Grant,

There is only a little doc included in my talk about scalingcollaborative filtering [1], the algorithm is a modification of the ideaof [2], so that might help too.

RowSimilarityJob is also in the heart of our distributed recommendationcode, from reading [2] I guess it's used by foursquare and I know ofother companies using it which I can't publicly talk about.

Generally row similarity is not only dependent on the size of thematrix, but also very much on its shape.

If it's sparse than RowSimilarityJob will only compare pairs of rowswith a cooccurring value in a dimension, that is it's advantage over anaive all-pairs-comparison.

If it's dense or has some very dense rows/columns than you will do anall-pairs-comparison which has quadratic growth and you'd probably bebetter off using another approach.

Could you give some numbers about the size of your input matrix and thevalue of the counter COOCCURRENCES from RowSimilarityJob?


--sebastian


[1] http://www.slideshare.net/sscdotopen/mahoutcf

[2]http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9712&rep=rep1&type=pdf[3]http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/





On 14.07.2011 18:57, Grant Ingersoll wrote:

Are there docs on RowSimilarity?Also, has anyone tried it at scale? I'm seeing some long running times

for a matrix that I don't think is huge (still waiting to hear fromcolleague about actual size) What does the distributed vectorsimilarity get us over just using our existing distance measures?


Also, would there be interest in a job that is basically the map side of 
K-Means and simply outputs the distance between some vector and a list of 
vectors where the seed vectors fit in memory? It's similar to RowSimilarity, 
but it doesn't bother with the co-ocurrence calculation.


-Grant

Re: RowSimilarity ?'s

Reply via email to