Re: [Nutch-dev] Re: MapReduce WebDB writer

Stefan Groschupf Mon, 28 Mar 2005 04:58:55 -0800

Hi Feng,

after reading your code I think I more and more understand your idea. Sorry for being so slow. :-) To summarize with my words: tools will write edits with a commit id, edits are reduced via mr and merged with the webdb. Right?

However wouldn't be useful to have tools write small independent edit files that can be merged with the web db any time? It would provide a less coupling of tools and web db. The disk space is the same in any way.

Anyway from my poor knowledge your code much makes sense. I suggest not changing the webdb interface but duplicate it and change the new. I think to break the code of tools isn't a good idea.

Stefan


Am 26.03.2005 um 23:55 schrieb Feng Zhou:

Hi Stefan,

I've posted the code at
http://security-gate.cs.berkeley.edu/~zf/nutch/mrdb-test.tar.gz. It
won't compile because I changed a few other bits of the mapreduce
code. But it should be enough for explanatory purpose.

In general I do not clearly understand the idea behind a "master" and
the MapredWebDBCommitter.
Isn't this handled by the jobtracker and the job itself?
When browsing the Grep job then you can see that the grep job itself
has the grepJob and sortJob, so you are able to manage 'flows' in the
job itself.


By "master" I mean the node starting the MapReduce process, i.e.
calling JobClient.runJob(). Sorry I didn't explain it (it's from the
mapreduce paper). The reason to add another class is that both the
master and workers needs a way to reference the generic webdb writer.
In it's current form, master will access the committer and workers
will access their respective writer. Certainly this breaks the IWebDB
contract. But it seems still close enough.

* create inputformat for the segment file(s). * writing a mapper that creates several small unsorted webdb's. * writing a combiner that merges this small webdb's with the existing webdb in to a temp webdb. * writing a reducer that is able to sort and merge the entries of the temp webdb.


To understand you better, the "segment files" that inputformat reads
refers to fetch results, right? If yes, you are refering to what the
"updatedb" tool will do, right? I'm thinking a little bit differently,
by keeping as much of the current WebDBWriter interface. That is, the
tool will not read/write the DB all by itself. It will still call
methods like dbwriter.addPage() to write to the DB. This way you don't
have to write the whole MapReduce process all over to do another kinda
of mutation of the DB. Apart from that difference, my code kinda does
the same thing, although I didn't use a combiner. All merging work is
done in reduction.

- Feng


As mentioned may be I missed something, but since the job itself is a
kind of master the processes can be managed from the job.
Since all files would be written in a ndf folder that is unique it is
may not necessary to have any kind of id.

Anyway I would love to see the code you mentioned to understand your
ideas.

Stefan

------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers

---------------------------------------------------------------
company:                http://www.media-style.com
forum:          http://www.text-mining.org
blog:                   http://www.find23.net

Re: [Nutch-dev] Re: MapReduce WebDB writer

Reply via email to