Re: LSH notes on text documents

Ted Dunning Tue, 26 Apr 2011 11:13:13 -0700

On Tue, Apr 26, 2011 at 10:33 AM, Jake Mannix <[email protected]> wrote:


> But it doesn't run as a Hadoop job?  It's embarassingly parallel, right,
> and the hashes could be IntWritable or LongWritable, seems to pretty
> naturally fit in Mahout in this way.
>

It depends on the use, I think.  The basic program should fit either way,
but having a very large in-memory structure for the search makes it fit
map-reduce a little less well.  Good use of mmap here might make it fit
well.


> > Thus, it's natural shape seems to me to be a web service, not a
> > map-reduce thing. Are we interested as a project? Does this
> > description make any sense?
> >
>
> Maybe your impl seems more of a web service, but I've naturally run
> it more as a big batch operation: single pass over the data, send
> everybody to their hash buckets in the mapper, and then use the
> reducer just to sort by bucket while tagging.  Optionally build
> bloom filters in the reducer too, to compress your clusters.
>

I think that this is a good fit as well.

Re: LSH notes on text documents

Reply via email to