How to get all the crawled pages for perticular domain

bhavin pandya Wed, 09 Dec 2009 01:23:02 -0800

Hi,

I have setup nutch 1.0 on cluster of 3 nodes.


We are running two application.

1. Nutch based search application.
We have successfully crawled approx. 25m pages on 3 nodes.
It's working as per expectation.

2. I am running application which needs to extract some information
for perticular domain.
As of date this application uses heritrix based crawler which crawls
the given domain, algorithms goes into pages and extract required
information.

As we are crawling in Nutch in distributed mode. we don't want to
recrawl using other tool like Heritrix for 2nd application.
I want to utilize same crawled data for 2nd application also.

But extraction algorithms requires all the crawled pages for
perticular domain, to extract all relevant information about that
domain.

I have thought of if somehow by writing some plugin in Nutch if i can
feed nutch crawled data to 2nd application then it will really save
our work, money and effort by not recrawling again.

But how do i get all the crawled pages for perticular domain in my
plugin?  Where i should look in nutch code.

Any pointer / idea in this direction will really help.

Thanks.
Bhavin




-- 
- Bhavin

How to get all the crawled pages for perticular domain

Reply via email to