Hi Stefan, I've posted the code at http://security-gate.cs.berkeley.edu/~zf/nutch/mrdb-test.tar.gz. It won't compile because I changed a few other bits of the mapreduce code. But it should be enough for explanatory purpose.
> In general I do not clearly understand the idea behind a "master" and > the MapredWebDBCommitter. > Isn't this handled by the jobtracker and the job itself? > When browsing the Grep job then you can see that the grep job itself > has the grepJob and sortJob, so you are able to manage 'flows' in the > job itself. By "master" I mean the node starting the MapReduce process, i.e. calling JobClient.runJob(). Sorry I didn't explain it (it's from the mapreduce paper). The reason to add another class is that both the master and workers needs a way to reference the generic webdb writer. In it's current form, master will access the committer and workers will access their respective writer. Certainly this breaks the IWebDB contract. But it seems still close enough. > > * create inputformat for the segment file(s). > * writing a mapper that creates several small unsorted webdb's. > * writing a combiner that merges this small webdb's with the existing > webdb in to a temp webdb. > * writing a reducer that is able to sort and merge the entries of the > temp webdb. To understand you better, the "segment files" that inputformat reads refers to fetch results, right? If yes, you are refering to what the "updatedb" tool will do, right? I'm thinking a little bit differently, by keeping as much of the current WebDBWriter interface. That is, the tool will not read/write the DB all by itself. It will still call methods like dbwriter.addPage() to write to the DB. This way you don't have to write the whole MapReduce process all over to do another kinda of mutation of the DB. Apart from that difference, my code kinda does the same thing, although I didn't use a combiner. All merging work is done in reduction. - Feng > > As mentioned may be I missed something, but since the job itself is a > kind of master the processes can be managed from the job. > Since all files would be written in a ndf folder that is unique it is > may not necessary to have any kind of id. > > Anyway I would love to see the code you mentioned to understand your > ideas. > > Stefan > >
