Florent Gluck wrote:
When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for CrawlDb: crawldb
060110 171347 TOTAL urls: 27939
060110 171347 avg score: 1.011
060110 171347 max score: 8.883
060110 171347 min score: 1.0
060110 171347 retry 0: 26429
060110 171347 retry 1: 1510
060110 171347 status 1 (DB_unfetched): 24248
060110 171347 status 2 (DB_fetched): 3390
060110 171347 status 3 (DB_gone): 301
060110 171347 CrawlDb statistics: done
There are several things that don't make sense to me and it would be
great if someone could clear this up:
1.
If I compute the number of occurences of "fetching" in all of my slaves'
tasktracker logs, I get: 6225
This number clearly doesn't match the DB_fetched of 3390 from the
readdb output. Why is that ?
What happened to the 6225-3390=2835 missing urls?
How many errors are you seeing while fetching? Are you getting, e.g.,
lots of timeouts or "max delays exceeded"?
You might also try using protocol-http rather than protocol-httpclient.
Others have reported under-fetching issues with protocol-httpclient.
2.
Why is the TOTAL urls: 27939 if I inject a file with 25000 entries?
Why is it not 25000 ?
When the crawl db is updated it adds pages linked to by fetched pages,
with status DB_unfetched.
3.
What is the meaning of DB_gone and DB_unfetched?
I was assuming if you inject a total of 25k urls where 5000 are
fetchable ones, you would get something like:
(DB_unfetched): 20000
(DB_fetched): 5000
It's not the case, so I'd like to understand what's exactly going on here.
Also, what is the meaning of DB_gone ?
DB_gone means that a 404 or some other presumably permanent error was
encountered. This status prevents future attempts to fetch a url.
4.
If I redo (starting from an empty crawldb of course) the exact same
inject + crawl with the same 25000 urls, but I use the following mapred
settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I
get the following readdb output:
060110 162140 TOTAL urls: 33173
060110 162140 avg score: 1.026
060110 162140 max score: 22.083
060110 162140 min score: 1.0
060110 162140 retry 0: 28381
060110 162140 retry 1: 4792
060110 162140 status 1 (DB_unfetched): 23136
060110 162140 status 2 (DB_fetched): 9234
060110 162140 status 3 (DB_gone): 803
060110 162140 CrawlDb statistics: done
How come the DB_fetched is about 3x more than earlier and the TOTAL urls goes
way beyond the 27939 from before?
It doesn't make any sense. I'd expect to see similar results as before
with the other mapred settings.
Please look at your fetcher errors. More, smaller fetch lists means
that each fetcher task has fewer unique hosts. I'd actually expect
fewer pages to succeed, but only an analysis of your fetcher errors will
fully explain this.
Again, the reason that the total is higher is that it includes new urls
discovered.
Doug