On Tue, Apr 26, 2011 at 1:24 PM, Benson Margulies <[email protected]>wrote:
> ... If the documents arrive serially, map-reduce is uninteresting to scale > across documents. > Absolutely. > However, there are the multiple hash tables. > > Now, for with parameters from Petrovic, for a large (1Mdoc) store, you > have ~72 tables. Yes, you could put 72 tables on 72 nodes: map sends > things to them, reduce collates the results. > S4 is better for the real-time use like this. > I've never seen a hadoop 'thing' that has permanent in-memory state > like this. I'm not sure where memory mapping comes into the picture. > Hadoop would help if you had 100,000 documents arrive at once. At that point, the memory mapping would help because you could run multiple threads in the mapper against the same large memory data structures. Without that, you duplicate the memory structures or you can't run many threads. Either is not so good. Also, if you have a high throughput web-service, you can map-reduce against that. Each mapper would read a document and throw it against the web-service. Since the LSH server is memory based anyway, that works pretty well. What you gain here is smooth multi-node reading of the input documents and parallelization of some parsing tasks. If folks are game, I'll poke the question of contribution some more. > I am interested.
