Hi, I have setup nutch 1.0 on cluster of 3 nodes.
We are running two application. 1. Nutch based search application. We have successfully crawled approx. 25m pages on 3 nodes. It's working as per expectation. 2. I am running application which needs to extract some information for perticular domain. As of date this application uses heritrix based crawler which crawls the given domain, algorithms goes into pages and extract required information. As we are crawling in Nutch in distributed mode. we don't want to recrawl using other tool like Heritrix for 2nd application. I want to utilize same crawled data for 2nd application also. But extraction algorithms requires all the crawled pages for perticular domain, to extract all relevant information about that domain. I have thought of if somehow by writing some plugin in Nutch if i can feed nutch crawled data to 2nd application then it will really save our work, money and effort by not recrawling again. But how do i get all the crawled pages for perticular domain in my plugin? Where i should look in nutch code. Any pointer / idea in this direction will really help. Thanks. Bhavin -- - Bhavin
