Hi,

I'm running nutch trunk as of today.  I have 3 slaves and a master.  I'm
using *mapred.map.tasks=20* and *mapred.reduce.tasks=4*
There is something I'm really confused about.

When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for CrawlDb: crawldb
060110 171347 TOTAL urls:       27939
060110 171347 avg score:        1.011
060110 171347 max score:        8.883
060110 171347 min score:        1.0
060110 171347 retry 0:  26429
060110 171347 retry 1:  1510
060110 171347 status 1 (DB_unfetched):  24248
060110 171347 status 2 (DB_fetched):    3390
060110 171347 status 3 (DB_gone):       301
060110 171347 CrawlDb statistics: done

There are several things that don't make sense to me and it would be
great if someone could clear this up:

1.
If I compute the number of occurences of "fetching" in all of my slaves'
tasktracker logs, I get: 6225
This number clearly doesn't match the *DB_fetched* of 3390 from the
readdb output.  Why is that ?
What happened to the 6225-3390=2835 urls missing ?

2.
Why is the *TOTAL urls: 27939* if I inject a file with 25000 entries ?
Why is it not 25000 ?

3.
What is the meaning of *DB_gone* and *DB_unfetched* ?
I was assuming if you inject a total of 25k urls where 5000 are
fetchable ones, you would get something like:
(DB_unfetched):  20000
(DB_fetched):    5000
It's not the case, so I'd like to understand what's exactly going on here.

4.
If I redo (starting from an empty crawldb of course) the exact same
inject + crawl with the same 25000 urls, but I use the following mapred
settings instead: *mapred.map.tasks=200* and *mapred.reduce.tasks=8*, I
get the following readdb output:
060110 162140 TOTAL urls:       33173
060110 162140 avg score:        1.026
060110 162140 max score:        22.083
060110 162140 min score:        1.0
060110 162140 retry 0:  28381
060110 162140 retry 1:  4792
060110 162140 status 1 (DB_unfetched):  23136
060110 162140 status 2 (DB_fetched):    9234
060110 162140 status 3 (DB_gone):       803
060110 162140 CrawlDb statistics: done
How come the *DB_fetched *is about 3x more and the *TOTAL urls *goes
beyond 25000 ???
It doesn't make any sense.  I'd expect to see similar results as before
with the other mapred settings.

Thank you,
Florent

Reply via email to