Gaurav Agarwal wrote:
Hi Andrzej,
Thanks a lot for pointing out the features to me. I greatly appreciate the
help. Things look a lot better now :)
Just one more thing: Can you point me to any document/email/discussion
(internal or published) which can give me some insights about the
architecture of Nutch 0.8.x and may be the information on the kind of data
that goes in every directory.
If Wiki doesn't already contain this info (I didn't check) then only the
mailing lists may contain it ... though most of the stuff is the same,
the basic work cycle is still the same. Data formats differ, e.g. webdb
was split into two parts, outlinks are stored in crawl_parse (and in
parse_data), and there are those funky part-xxxx subdirectories, which
are a side-effect of using Hadoop. Other than that not much changed in
the data layout.
When it comes to the architecture, it was completely rewritten - I don't
think there's any detailed documentation on this, though...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com