Hi all,

I've been reading this list for some time and Nutch is really a
project with great potential. I've played with the MapReduce
implementation in Nutch for a while and written some toy code. I'm
very interesting in helping with rewriting some of the distributed
algorithms in MapReduce.

Actually I have already partially implemented a MapReduce WebDB
writer. I don't know whether anyone else is working on this now. But
that's ok anyway. It uses basically the same algorithm as Mike
Cafarella's DistributedWebDBWriter. Here're some issues I'd like to
discuss. As I'm still new to Nutch, I could be wrong.

The overall approach is based on the assumption that tools using the
new MapReduce-based WebDB writer will be using MapReduce themselves
too. So there will a single "master" node controlling the whole
process. This I think makes it necessary to change the WebDB writer
interface a little bit. The previous writer model is kind of
"symmetric" with no controlling node. So I added a
MapredWebDBCommitter to run on this master. A typical write process
would be,
 
 1. Master creates a MapredWebDBCommitter and gets an commitID from it.
 2. Master calls MapReduce to do its work specific to the tool, e.g.
reaping fetch results for "updatedb". In this process, each worker
will create a MapredWebDBWriter using the commitID passed to it and
append edits.
 3. Worker close their MapredWebDBWriter. The writer in turn closes
the edits files but does not alter the database itself. This is
different from before.
 3. After all tool-specific work is done, master calls
committer.commit(). This does the real work of regenerating the
tables, using MapReduce of course.

As DistributedWebDBWriter, the shutdown/commit process is done in
stages, each roughly generates one new table. However, each stage is
now one or more MapReduce jobs. Here's how I did the first stage, i.e.
generating the PagesByURL table and some other edits.

 x The input for mapping is simply all edits (PageInstruction). Thus a
"seq" input format is used to read the edits files. A custom
partitioner class is used to send all edits to a specific DB section
to the same reducer, indexed by the DB number.
 x Each reducer, then opens the corresponding DB section and merges
with each edit fed to the reduce() method. The resulting entries are
written into a new table file. And new edits are written to edit
files, one per reduce task. These are used as input to next stage's
MapReduce.
 x The specific benefits of using MapReduce here are easy
partitioning, automatic sorting and data transferring, plus the normal
failure handling and others.
 x One problem with this is that the number of reduce tasks must be
the same as the number of DB sections. An alternative design (could
more scalable) would be to use another mapreduce to shuffle out all
entries in the DB.

Does this make sense? This seems to be working fine for the first
stage (I've run some trivial tests). But I haven't really looked into
the later stages yet. I'll be happy to post the code if people are
interested.

Comments?

Regards,
- Feng Zhou
Grad student, CS, UC Berkeley

Reply via email to