Thanks for your answers Doug, it makes more sense now.
I'm still puzzled about why the number of DB_fetched changes so much
when using different number for the map/reduce task settings.
I'm gonna inspect the logs and see if I can track down what's going on.
Also, I tried to use protocol-http rather than protocol-httpclient, but
it didn't make any difference.

Thanks,
Florent

Doug Cutting wrote:

> Florent Gluck wrote:
>
>> When I inject 25000 urls and fetch them (depth = 1) and do a readdb
>> -stats, I get:
>> 060110 171347 Statistics for CrawlDb: crawldb
>> 060110 171347 TOTAL urls:       27939
>> 060110 171347 avg score:        1.011
>> 060110 171347 max score:        8.883
>> 060110 171347 min score:        1.0
>> 060110 171347 retry 0:  26429
>> 060110 171347 retry 1:  1510
>> 060110 171347 status 1 (DB_unfetched):  24248
>> 060110 171347 status 2 (DB_fetched):    3390
>> 060110 171347 status 3 (DB_gone):       301
>> 060110 171347 CrawlDb statistics: done
>>
>> There are several things that don't make sense to me and it would be
>> great if someone could clear this up:
>>
>> 1.
>> If I compute the number of occurences of "fetching" in all of my slaves'
>> tasktracker logs, I get: 6225
>> This number clearly doesn't match the DB_fetched of 3390 from the
>> readdb output.  Why is that ?
>> What happened to the 6225-3390=2835 missing urls?
>
>
> How many errors are you seeing while fetching?  Are you getting, e.g.,
> lots of timeouts or "max delays exceeded"?
>
> You might also try using protocol-http rather than
> protocol-httpclient.  Others have reported under-fetching issues with
> protocol-httpclient.
>
>> 2.
>> Why is the TOTAL urls: 27939 if I inject a file with 25000 entries?
>> Why is it not 25000 ?
>
>
> When the crawl db is updated it adds pages linked to by fetched pages,
> with status DB_unfetched.
>
>> 3.
>> What is the meaning of DB_gone and DB_unfetched?
>> I was assuming if you inject a total of 25k urls where 5000 are
>> fetchable ones, you would get something like:
>> (DB_unfetched):  20000
>> (DB_fetched):    5000
>> It's not the case, so I'd like to understand what's exactly going on
>> here.
>> Also, what is the meaning of DB_gone ?
>
>
> DB_gone means that a 404 or some other presumably permanent error was
> encountered.  This status prevents future attempts to fetch a url.
>
>> 4.
>> If I redo (starting from an empty crawldb of course) the exact same
>> inject + crawl with the same 25000 urls, but I use the following mapred
>> settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I
>> get the following readdb output:
>> 060110 162140 TOTAL urls:       33173
>> 060110 162140 avg score:        1.026
>> 060110 162140 max score:        22.083
>> 060110 162140 min score:        1.0
>> 060110 162140 retry 0:  28381
>> 060110 162140 retry 1:  4792
>> 060110 162140 status 1 (DB_unfetched):  23136
>> 060110 162140 status 2 (DB_fetched):    9234
>> 060110 162140 status 3 (DB_gone):       803
>> 060110 162140 CrawlDb statistics: done
>> How come the DB_fetched is about 3x more than earlier and the TOTAL
>> urls goes
>> way beyond the 27939 from before?
>> It doesn't make any sense.  I'd expect to see similar results as before
>> with the other mapred settings.
>
>
> Please look at your fetcher errors.  More, smaller fetch lists means
> that each fetcher task has fewer unique hosts.  I'd actually expect
> fewer pages to succeed, but only an analysis of your fetcher errors
> will fully explain this.
>
> Again, the reason that the total is higher is that it includes new
> urls discovered.
>
> Doug
>

Reply via email to