Re: MapReduce WebDB writer

Feng Zhou Sat, 26 Mar 2005 14:55:52 -0800

Hi Stefan,

I've posted the code at
http://security-gate.cs.berkeley.edu/~zf/nutch/mrdb-test.tar.gz. It
won't compile because I changed a few other bits of the mapreduce
code. But it should be enough for explanatory purpose.


> In general I do not clearly understand the idea behind a "master" and
> the MapredWebDBCommitter.
> Isn't this handled by the jobtracker and the job itself?
> When browsing the Grep job then you can see that the grep job itself
> has the grepJob and sortJob, so you are able to manage 'flows' in the
> job itself.

By "master" I mean the node starting the MapReduce process, i.e.
calling JobClient.runJob(). Sorry I didn't explain it (it's from the
mapreduce paper). The reason to add another class is that both the
master and workers needs a way to reference the generic webdb writer.
In it's current form, master will access the committer and workers
will access their respective writer. Certainly this breaks the IWebDB
contract. But it seems still close enough.

> 
> * create inputformat for the segment file(s).
> * writing a mapper that creates several small unsorted webdb's.
> * writing a combiner  that merges this small webdb's with  the existing
> webdb in to a temp webdb.
> * writing a reducer that is able to sort and merge the entries of the
> temp webdb.

To understand you better, the "segment files" that inputformat reads
refers to fetch results, right? If yes, you are refering to what the
"updatedb" tool will do, right? I'm thinking a little bit differently,
by keeping as much of the current WebDBWriter interface. That is, the
tool will not read/write the DB all by itself. It will still call
methods like dbwriter.addPage() to write to the DB. This way you don't
have to write the whole MapReduce process all over to do another kinda
of mutation of the DB. Apart from that difference, my code kinda does
the same thing, although I didn't use a combiner. All merging work is
done in reduction.

- Feng

> 
> As mentioned may be I missed something, but since the job itself is a
> kind of master the processes can be managed from the job.
> Since all files would be written in a ndf folder that is unique it is
> may not necessary to have any kind of id.
> 
> Anyway I would love to see the code you mentioned to understand your
> ideas.
> 
> Stefan
> 
>

Re: MapReduce WebDB writer

Reply via email to