On Tue, Apr 26, 2011 at 3:35 AM, Benson Margulies <[email protected]>wrote:
> "I" have an LSH implementation that might get itself contributed. I'm > in the middle of a desultory conversation with my colleagues about the > question of whether there is a good reason to retain it as a > closed-source item. I'm curious as to whether Mahout would be a > suitable home. > > The implementation follows Petrovic. More to the point, I've worked > very hard to minimize its memory footprint so as to allow it to sit > there in memory indexing a very large collection of documents (of > course, actually, feature vectors). The scheme is that all the data > lives in live Java objects, and new items are also written to a log > (made from google protocol buffers; I now realize avro might have been > more to the point). There is a modularity than anticipates wanting to > run on a scale where it couldn't fit into memory any more. > But it doesn't run as a Hadoop job? It's embarassingly parallel, right, and the hashes could be IntWritable or LongWritable, seems to pretty naturally fit in Mahout in this way. > > Thus, it's natural shape seems to me to be a web service, not a > map-reduce thing. Are we interested as a project? Does this > description make any sense? > Maybe your impl seems more of a web service, but I've naturally run it more as a big batch operation: single pass over the data, send everybody to their hash buckets in the mapper, and then use the reducer just to sort by bucket while tagging. Optionally build bloom filters in the reducer too, to compress your clusters. -jake
