Background of my problem: I am running Nutch1.4 on Hadoop0.20.203. There
are series of MapReduce jobs that i am performing on Nutch segments to get
final output. But waiting for whole crawl to happen before running
mapreduce causes solution to run for longer time. I am now triggering
MapReduce jobs on segments as soon as they are dumped. I am running crawl
in a loop('N=depth' times ) by giving depth=1.I am getting some urls
getting lost when i crawl with depth 1 in a loop N times vs crawl giving
depth N.

Please find below pseudo code:

*Case 1*: Nutch crawl on Hadoop giving depth=3.

// Create the list object to store arguments which we are going to pass to
NUTCH

List nutchArgsList = new ArrayList();

nutchArgsList.add("-depth");

nutchArgsList.add(Integer.toString(3));

<...other nutch args...>

ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));

*Case 2*: Crawling in loop 3 times with depth='1'

for(int depthRun=0;depthRun< 3;depthRun++) {

// Create the list object to store arguments which we are going to pass to
NUTCH

List nutchArgsList = new ArrayList();

nutchArgsList.add("-depth");

nutchArgsList.add(Integer.toString(1)); //*NOTE* i have given depth as 1
here

<...other nutch args...>

ToolRunner.run(nutchConf, new Crawl(), nutchArgsList.toArray(new
String[nutchArgsList.size()]));

}

I am getting some urls getting lost(db unfetched) when i crawling in loop
as many times as depth.

I have tried this on standalone Nutch where i run with depth 3 vs running 3
times over same urls with depth 1. I have compared the crawldb and urls
difference is only 12. But when i do the same on Hadoop using toolrunner i
am getting 1000 urls as db_unfetched.

As far i understood till now,Nutch triggers crawl in a loop as many times
as depth value. Please suggest.

Also please let me know why difference is huge when i do this on Hadoop
using toolrunner vs doing the same on standalone Nutch.

Thanks in advandce.


Regards:

Ashish V

Reply via email to