Hi everyone,

  I finally read through all of the Map/Reduce paper.  I thought it was
an interesting read.

  Maybe not surprisingly, a lot of the M/R ideas are already in the
Nutch code, especially in the current version of the distributed WebDB.

  As with the traditional WebDB, a table of items (pages, URLs) in the
dist-WebDB consists of a sorted set of rows.  These items are sorted
according to an item-specific key.  Where the standard WebDB has a
single file that contains these rows, the dist-WebDB has a set of
files.  Each file contains the rows for a single region of the keyspace.

  Dist-WebDB edits are first written to one of several files,
corresponding to the keyspace partition.  It then allocates a processor
for each of these partitions.  The processor sorts its partition of
edits, then applies those edits to the corresponding webdb partition.  A
single application of these edits might result in edits to other tables
in the dist-WebDB.  Those are allocated to partitions just like the
first set of edits.

  So the dist-WebDB is very similar to a Map/Reduce program, except all
the M/R logic is built into the application itself.

  Of course, the M/R system from the paper has a lot of machinery built
in to handle failure, retries, and other distributed management
problems.  There's a little bit of that in the dist-WebDB, but not
enough.  (Also, the dist-WebDB does not clearly work yet, so that's
another issue ;)

  Interesting...
  --Mike




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to