On Jul 14, 2011, at 1:09 PM, Sebastian Schelter wrote: > Hi Grant, > > There is only a little doc included in my talk about scaling collaborative > filtering [1], the algorithm is a modification of the idea of [2], so that > might help too. > > RowSimilarityJob is also in the heart of our distributed recommendation code, > from reading [2] I guess it's used by foursquare and I know of other > companies using it which I can't publicly talk about. > > Generally row similarity is not only dependent on the size of the matrix, but > also very much on its shape. > > If it's sparse than RowSimilarityJob will only compare pairs of rows with a > cooccurring value in a dimension, that is it's advantage over a naive > all-pairs-comparison.
yeah, it is sparse. Any thoughts on why not reuse our existing Distance measures? Seems like once you know that two vectors have something in common, there isn't much point in calculating all the co-occurrences, just save of those two (or whatever) and then later call the distance measure on the vectors. > > If it's dense or has some very dense rows/columns than you will do an > all-pairs-comparison which has quadratic growth and you'd probably be better > off using another approach. > > Could you give some numbers about the size of your input matrix and the value > of the counter COOCCURRENCES from RowSimilarityJob? Last I looked it was around 53B before it was killed. > > --sebastian > > > [1] http://www.slideshare.net/sscdotopen/mahoutcf > [2] > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9712&rep=rep1&type=pdf > [3] > http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/ > > > > > On 14.07.2011 18:57, Grant Ingersoll wrote: >> Are there docs on RowSimilarity?Also, has anyone tried it at scale? I'm >> seeing some long running times > for a matrix that I don't think is huge (still waiting to hear from colleague > about actual size) What does the distributed vector similarity get us over > just using our existing distance measures? >> >> Also, would there be interest in a job that is basically the map side of >> K-Means and simply outputs the distance between some vector and a list of >> vectors where the seed vectors fit in memory? It's similar to RowSimilarity, >> but it doesn't bother with the co-ocurrence calculation. >> >> >> -Grant >> >> >> > -------------------------- Grant Ingersoll
