Hi, I'm running nutch trunk as of today. I have 3 slaves and a master. I'm using *mapred.map.tasks=20* and *mapred.reduce.tasks=4* There is something I'm really confused about.
When I inject 25000 urls and fetch them (depth = 1) and do a readdb -stats, I get: 060110 171347 Statistics for CrawlDb: crawldb 060110 171347 TOTAL urls: 27939 060110 171347 avg score: 1.011 060110 171347 max score: 8.883 060110 171347 min score: 1.0 060110 171347 retry 0: 26429 060110 171347 retry 1: 1510 060110 171347 status 1 (DB_unfetched): 24248 060110 171347 status 2 (DB_fetched): 3390 060110 171347 status 3 (DB_gone): 301 060110 171347 CrawlDb statistics: done There are several things that don't make sense to me and it would be great if someone could clear this up: 1. If I compute the number of occurences of "fetching" in all of my slaves' tasktracker logs, I get: 6225 This number clearly doesn't match the *DB_fetched* of 3390 from the readdb output. Why is that ? What happened to the 6225-3390=2835 urls missing ? 2. Why is the *TOTAL urls: 27939* if I inject a file with 25000 entries ? Why is it not 25000 ? 3. What is the meaning of *DB_gone* and *DB_unfetched* ? I was assuming if you inject a total of 25k urls where 5000 are fetchable ones, you would get something like: (DB_unfetched): 20000 (DB_fetched): 5000 It's not the case, so I'd like to understand what's exactly going on here. 4. If I redo (starting from an empty crawldb of course) the exact same inject + crawl with the same 25000 urls, but I use the following mapred settings instead: *mapred.map.tasks=200* and *mapred.reduce.tasks=8*, I get the following readdb output: 060110 162140 TOTAL urls: 33173 060110 162140 avg score: 1.026 060110 162140 max score: 22.083 060110 162140 min score: 1.0 060110 162140 retry 0: 28381 060110 162140 retry 1: 4792 060110 162140 status 1 (DB_unfetched): 23136 060110 162140 status 2 (DB_fetched): 9234 060110 162140 status 3 (DB_gone): 803 060110 162140 CrawlDb statistics: done How come the *DB_fetched *is about 3x more and the *TOTAL urls *goes beyond 25000 ??? It doesn't make any sense. I'd expect to see similar results as before with the other mapred settings. Thank you, Florent
