Re: LSH notes on text documents

Ted Dunning Tue, 26 Apr 2011 13:45:20 -0700

On Tue, Apr 26, 2011 at 1:24 PM, Benson Margulies <[email protected]>wrote:


> ... If the documents arrive serially, map-reduce is uninteresting to scale
> across documents.
>

Absolutely.


> However, there are the multiple hash tables.
>
> Now, for with parameters from Petrovic, for a large (1Mdoc) store, you
> have ~72 tables. Yes, you could put 72 tables on 72 nodes: map sends
> things to them, reduce collates the results.
>

S4 is better for the real-time use like this.


> I've never seen a hadoop 'thing' that has permanent in-memory state
> like this. I'm not sure where memory mapping comes into the picture.
>

Hadoop would help if you had 100,000 documents arrive at once.  At that
point,
the memory mapping would help because you could run multiple threads in the
mapper
against the same large memory data structures.  Without that, you duplicate
the
memory structures or you can't run many threads.  Either is not so good.

Also, if you have a high throughput web-service, you can map-reduce against
that.
Each mapper would read a document and throw it against the web-service.
 Since
the LSH server is memory based anyway, that works pretty well.  What you
gain here
is smooth multi-node reading of the input documents and parallelization of
some
parsing tasks.

If folks are game, I'll poke the question of contribution some more.
>

I am interested.

Re: LSH notes on text documents

Reply via email to