"I" have an LSH implementation that might get itself contributed. I'm in the middle of a desultory conversation with my colleagues about the question of whether there is a good reason to retain it as a closed-source item. I'm curious as to whether Mahout would be a suitable home.
The implementation follows Petrovic. More to the point, I've worked very hard to minimize its memory footprint so as to allow it to sit there in memory indexing a very large collection of documents (of course, actually, feature vectors). The scheme is that all the data lives in live Java objects, and new items are also written to a log (made from google protocol buffers; I now realize avro might have been more to the point). There is a modularity than anticipates wanting to run on a scale where it couldn't fit into memory any more. Thus, it's natural shape seems to me to be a web service, not a map-reduce thing. Are we interested as a project? Does this description make any sense?
