Re: 0.8.x Crawler compared to 0.7.2 Crawler

Andrzej Bialecki Wed, 28 Mar 2007 12:42:57 -0800

Gaurav Agarwal wrote:

Hi Andrzej,


Thanks a lot for pointing out the features to me. I greatly appreciate the
help. Things look a lot better now :)

Just one more thing: Can you point me to any document/email/discussion
(internal or published) which can give me some insights about the
architecture of Nutch 0.8.x and may be the information on the kind of data
that goes in every directory.

If Wiki doesn't already contain this info (I didn't check) then only themailing lists may contain it ... though most of the stuff is the same,the basic work cycle is still the same. Data formats differ, e.g. webdbwas split into two parts, outlinks are stored in crawl_parse (and inparse_data), and there are those funky part-xxxx subdirectories, whichare a side-effect of using Hadoop. Other than that not much changed inthe data layout.

When it comes to the architecture, it was completely rewritten - I don'tthink there's any detailed documentation on this, though...



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: 0.8.x Crawler compared to 0.7.2 Crawler

Reply via email to