Re: 0.8.x Crawler compared to 0.7.2 Crawler

Andrzej Bialecki Tue, 27 Mar 2007 13:42:11 -0800

Gaurav Agarwal wrote:

Hi everyone,

Definitely the advantage with 0.8.x is that it models all most every

operation as  a Map-Reduce calls (which is amazing!), and therefore is much
more scalable; but in the absence of the API's mentioned above it does not
provide me much help to build the web-link graph from the crawler output.

There is a similar API for reading from the DB, which is calledCrawlDbReader. It is relatively simple compared to WebDBReader, becausemost of the support is already provided by Hadoop (i.e. the map-reduceframework).

In 0.8 and later the information about pages and information about linksare split into two different DB-s - crawldb and linkdb - but exactly thesame information can be obtained from them as before.


I may be completely wrong here and please correct me if I am, it looks like
post 0.8.0 release the thrust has been to develop the Nutch project
completely as an indexing library/application and the crawl module itself
loosing its independence or decoupling. With 0.8.x, the crawl output in
itself does not give much of useful information (or at least I failed to
locate such API's).

That's not the case - if anything, the amount of useful information youcan retrieve has tremendously increased. Please see all the toolsavailable through the bin/nutch script, and prefixed with read* - andthen look at their implementation for inspiration.


I'll rephrase my concerns as concrete questions:

1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the
information about crawled data like : get all pages(contents) given a

Fetched pages are stored in segments. Please see SegmentReader tool thatallows you to retrieve the segment content.

URL/md5, get outgoing links from a URL and get all incoming links to a


SegmentReader as above. For incoming links use the linkdb, and LinkDbReader.

URL(this last API is provided; i mentioned it for the sake of completeness).
Or an easy way I can improvise these API's.

2) If answer to 1 is NO, are there any plans to add these functionality back
in the forthcoming releases.

3) If answer to both 1 and 2 is NO, can someone point me to the discussions
which explains the rationale behind making these changes to the interface
which (in my opinion) leaves the crawler module slightly weakened ( I tried
scanning the forum posts till the era when 0.7.2 was released but failed to
locate any such discussion).


Please see above. The answer is yes. ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: 0.8.x Crawler compared to 0.7.2 Crawler

Reply via email to