Gaurav Agarwal wrote:
Hi everyone,
Definitely the advantage with 0.8.x is that it models all most every
operation as  a Map-Reduce calls (which is amazing!), and therefore is much
more scalable; but in the absence of the API's mentioned above it does not
provide me much help to build the web-link graph from the crawler output.

There is a similar API for reading from the DB, which is called CrawlDbReader. It is relatively simple compared to WebDBReader, because most of the support is already provided by Hadoop (i.e. the map-reduce framework).

In 0.8 and later the information about pages and information about links are split into two different DB-s - crawldb and linkdb - but exactly the same information can be obtained from them as before.



I may be completely wrong here and please correct me if I am, it looks like
post 0.8.0 release the thrust has been to develop the Nutch project
completely as an indexing library/application and the crawl module itself
loosing its independence or decoupling. With 0.8.x, the crawl output in
itself does not give much of useful information (or at least I failed to
locate such API's).


That's not the case - if anything, the amount of useful information you can retrieve has tremendously increased. Please see all the tools available through the bin/nutch script, and prefixed with read* - and then look at their implementation for inspiration.



I'll rephrase my concerns as concrete questions:

1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the
information about crawled data like : get all pages(contents) given a

Fetched pages are stored in segments. Please see SegmentReader tool that allows you to retrieve the segment content.


URL/md5, get outgoing links from a URL and get all incoming links to a

SegmentReader as above. For incoming links use the linkdb, and LinkDbReader.

URL(this last API is provided; i mentioned it for the sake of completeness).
Or an easy way I can improvise these API's.

2) If answer to 1 is NO, are there any plans to add these functionality back
in the forthcoming releases.

3) If answer to both 1 and 2 is NO, can someone point me to the discussions
which explains the rationale behind making these changes to the interface
which (in my opinion) leaves the crawler module slightly weakened ( I tried
scanning the forum posts till the era when 0.7.2 was released but failed to
locate any such discussion).

Please see above. The answer is yes. ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to