Hi everyone, I finally read through all of the Map/Reduce paper. I thought it was an interesting read.
Maybe not surprisingly, a lot of the M/R ideas are already in the Nutch code, especially in the current version of the distributed WebDB. As with the traditional WebDB, a table of items (pages, URLs) in the dist-WebDB consists of a sorted set of rows. These items are sorted according to an item-specific key. Where the standard WebDB has a single file that contains these rows, the dist-WebDB has a set of files. Each file contains the rows for a single region of the keyspace. Dist-WebDB edits are first written to one of several files, corresponding to the keyspace partition. It then allocates a processor for each of these partitions. The processor sorts its partition of edits, then applies those edits to the corresponding webdb partition. A single application of these edits might result in edits to other tables in the dist-WebDB. Those are allocated to partitions just like the first set of edits. So the dist-WebDB is very similar to a Map/Reduce program, except all the M/R logic is built into the application itself. Of course, the M/R system from the paper has a lot of machinery built in to handle failure, retries, and other distributed management problems. There's a little bit of that in the dist-WebDB, but not enough. (Also, the dist-WebDB does not clearly work yet, so that's another issue ;) Interesting... --Mike ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
