Hi Grant,
There is only a little doc included in my talk about scaling
collaborative filtering [1], the algorithm is a modification of the idea
of [2], so that might help too.
RowSimilarityJob is also in the heart of our distributed recommendation
code, from reading [2] I guess it's used by foursquare and I know of
other companies using it which I can't publicly talk about.
Generally row similarity is not only dependent on the size of the
matrix, but also very much on its shape.
If it's sparse than RowSimilarityJob will only compare pairs of rows
with a cooccurring value in a dimension, that is it's advantage over a
naive all-pairs-comparison.
If it's dense or has some very dense rows/columns than you will do an
all-pairs-comparison which has quadratic growth and you'd probably be
better off using another approach.
Could you give some numbers about the size of your input matrix and the
value of the counter COOCCURRENCES from RowSimilarityJob?
--sebastian
[1] http://www.slideshare.net/sscdotopen/mahoutcf
[2]
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.144.9712&rep=rep1&type=pdf
[3]
http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/
On 14.07.2011 18:57, Grant Ingersoll wrote:
Are there docs on RowSimilarity?Also, has anyone tried it at scale? I'm seeing some long running times
for a matrix that I don't think is huge (still waiting to hear from
colleague about actual size) What does the distributed vector
similarity get us over just using our existing distance measures?
Also, would there be interest in a job that is basically the map side of
K-Means and simply outputs the distance between some vector and a list of
vectors where the seed vectors fit in memory? It's similar to RowSimilarity,
but it doesn't bother with the co-ocurrence calculation.
-Grant