Hi all, I've been reading this list for some time and Nutch is really a project with great potential. I've played with the MapReduce implementation in Nutch for a while and written some toy code. I'm very interesting in helping with rewriting some of the distributed algorithms in MapReduce.
Actually I have already partially implemented a MapReduce WebDB writer. I don't know whether anyone else is working on this now. But that's ok anyway. It uses basically the same algorithm as Mike Cafarella's DistributedWebDBWriter. Here're some issues I'd like to discuss. As I'm still new to Nutch, I could be wrong. The overall approach is based on the assumption that tools using the new MapReduce-based WebDB writer will be using MapReduce themselves too. So there will a single "master" node controlling the whole process. This I think makes it necessary to change the WebDB writer interface a little bit. The previous writer model is kind of "symmetric" with no controlling node. So I added a MapredWebDBCommitter to run on this master. A typical write process would be, 1. Master creates a MapredWebDBCommitter and gets an commitID from it. 2. Master calls MapReduce to do its work specific to the tool, e.g. reaping fetch results for "updatedb". In this process, each worker will create a MapredWebDBWriter using the commitID passed to it and append edits. 3. Worker close their MapredWebDBWriter. The writer in turn closes the edits files but does not alter the database itself. This is different from before. 3. After all tool-specific work is done, master calls committer.commit(). This does the real work of regenerating the tables, using MapReduce of course. As DistributedWebDBWriter, the shutdown/commit process is done in stages, each roughly generates one new table. However, each stage is now one or more MapReduce jobs. Here's how I did the first stage, i.e. generating the PagesByURL table and some other edits. x The input for mapping is simply all edits (PageInstruction). Thus a "seq" input format is used to read the edits files. A custom partitioner class is used to send all edits to a specific DB section to the same reducer, indexed by the DB number. x Each reducer, then opens the corresponding DB section and merges with each edit fed to the reduce() method. The resulting entries are written into a new table file. And new edits are written to edit files, one per reduce task. These are used as input to next stage's MapReduce. x The specific benefits of using MapReduce here are easy partitioning, automatic sorting and data transferring, plus the normal failure handling and others. x One problem with this is that the number of reduce tasks must be the same as the number of DB sections. An alternative design (could more scalable) would be to use another mapreduce to shuffle out all entries in the DB. Does this make sense? This seems to be working fine for the first stage (I've run some trivial tests). But I haven't really looked into the later stages yet. I'll be happy to post the code if people are interested. Comments? Regards, - Feng Zhou Grad student, CS, UC Berkeley
