Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There are series of MapReduce jobs that i am performing on Nutch segments to get final output. But waiting for whole crawl to happen before running mapreduce causes solution to run for longer time. I am now triggering MapReduce jobs on segments as soon as they are dumped. I am running crawl in a loop('N=depth' times ) by giving depth=1.I am getting some urls getting lost when i crawl with depth 1 in a loop N times vs crawl giving depth N.
Please find below pseudo code: *Case 1*: Nutch crawl on Hadoop giving depth=3. // Create the list object to store arguments which we are going to pass to NUTCH List nutchArgsList = new ArrayList(); nutchArgsList.add("-depth"); nutchArgsList.add(Integer.toString(3)); <...other nutch args...> ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new String[nutchArgsList.size()])); *Case 2*: Crawling in loop 3 times with depth='1' for(int depthRun=0;depthRun< 3;depthRun++) { // Create the list object to store arguments which we are going to pass to NUTCH List nutchArgsList = new ArrayList(); nutchArgsList.add("-depth"); nutchArgsList.add(Integer.toString(1)); //*NOTE* i have given depth as 1 here <...other nutch args...> ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new String[nutchArgsList.size()])); } I am getting some urls getting lost(db unfetched) when i crawling in loop as many times as depth. I have tried this on standalone Nutch where i run with depth 3 vs running 3 times over same urls with depth 1. I have compared the crawldb and urls difference is only 12. But when i do the same on Hadoop using toolrunner i am getting 1000 urls as db_unfetched. As far i understood till now,Nutch triggers crawl in a loop as many times as depth value. Please suggest. Also please let me know why difference is huge when i do this on Hadoop using toolrunner vs doing the same on standalone Nutch. Thanks in advandce. Regards: Ashish V