Are there docs on RowSimilarity? Also, has anyone tried it at scale? I'm seeing some long running times for a matrix that I don't think is huge (still waiting to hear from colleague about actual size) What does the distributed vector similarity get us over just using our existing distance measures?
Also, would there be interest in a job that is basically the map side of K-Means and simply outputs the distance between some vector and a list of vectors where the seed vectors fit in memory? It's similar to RowSimilarity, but it doesn't bother with the co-ocurrence calculation. -Grant
