Thanks for your answers Doug, it makes more sense now.
I'm still puzzled about why the number of DB_fetched changes so much
when using different number for the map/reduce task settings.
I'm gonna inspect the logs and see if I can track down what's going on.
Also, I tried to use protocol-http rather than protocol-httpclient, but
it didn't make any difference.

Thanks,
Florent

Doug Cutting wrote:

> Florent Gluck wrote:
>
>> When I inject 25000 urls and fetch them (depth = 1) and do a readdb
>> -stats, I get:
>> 060110 171347 Statistics for CrawlDb: crawldb
>> 060110 171347 TOTAL urls:       27939
>> 060110 171347 avg score:        1.011
>> 060110 171347 max score:        8.883
>> 060110 171347 min score:        1.0
>> 060110 171347 retry 0:  26429
>> 060110 171347 retry 1:  1510
>> 060110 171347 status 1 (DB_unfetched):  24248
>> 060110 171347 status 2 (DB_fetched):    3390
>> 060110 171347 status 3 (DB_gone):       301
>> 060110 171347 CrawlDb statistics: done
>>
>> There are several things that don't make sense to me and it would be
>> great if someone could clear this up:
>>
>> 1.
>> If I compute the number of occurences of "fetching" in all of my slaves'
>> tasktracker logs, I get: 6225
>> This number clearly doesn't match the DB_fetched of 3390 from the
>> readdb output.  Why is that ?
>> What happened to the 6225-3390=2835 missing urls?
>
>
> How many errors are you seeing while fetching?  Are you getting, e.g.,
> lots of timeouts or "max delays exceeded"?
>
> You might also try using protocol-http rather than
> protocol-httpclient.  Others have reported under-fetching issues with
> protocol-httpclient.
>
>> 2.
>> Why is the TOTAL urls: 27939 if I inject a file with 25000 entries?
>> Why is it not 25000 ?
>
>
> When the crawl db is updated it adds pages linked to by fetched pages,
> with status DB_unfetched.
>
>> 3.
>> What is the meaning of DB_gone and DB_unfetched?
>> I was assuming if you inject a total of 25k urls where 5000 are
>> fetchable ones, you would get something like:
>> (DB_unfetched):  20000
>> (DB_fetched):    5000
>> It's not the case, so I'd like to understand what's exactly going on
>> here.
>> Also, what is the meaning of DB_gone ?
>
>
> DB_gone means that a 404 or some other presumably permanent error was
> encountered.  This status prevents future attempts to fetch a url.
>
>> 4.
>> If I redo (starting from an empty crawldb of course) the exact same
>> inject + crawl with the same 25000 urls, but I use the following mapred
>> settings instead: mapred.map.tasks=200 and mapred.reduce.tasks=8, I
>> get the following readdb output:
>> 060110 162140 TOTAL urls:       33173
>> 060110 162140 avg score:        1.026
>> 060110 162140 max score:        22.083
>> 060110 162140 min score:        1.0
>> 060110 162140 retry 0:  28381
>> 060110 162140 retry 1:  4792
>> 060110 162140 status 1 (DB_unfetched):  23136
>> 060110 162140 status 2 (DB_fetched):    9234
>> 060110 162140 status 3 (DB_gone):       803
>> 060110 162140 CrawlDb statistics: done
>> How come the DB_fetched is about 3x more than earlier and the TOTAL
>> urls goes
>> way beyond the 27939 from before?
>> It doesn't make any sense.  I'd expect to see similar results as before
>> with the other mapred settings.
>
>
> Please look at your fetcher errors.  More, smaller fetch lists means
> that each fetcher task has fewer unique hosts.  I'd actually expect
> fewer pages to succeed, but only an analysis of your fetcher errors
> will fully explain this.
>
> Again, the reason that the total is higher is that it includes new
> urls discovered.
>
> Doug
>



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to