Andrzej,

I ran 2 crawls of 1 pass each, injecting 100'000 urls.
Here is the output of /readdb -stats/ when crawling with /protocol-http/:

060123 162250 TOTAL urls:       119221
060123 162250 avg score:        1.023
060123 162250 max score:        240.666
060123 162250 min score:        1.0
060123 162250 retry 0:  56648
060123 162250 retry 1:  62573
060123 162250 status 1 (DB_unfetched):  89068
060123 162250 status 2 (DB_fetched):    27513
060123 162250 status 3 (DB_gone):       2640

And here is the output when crawling with /protocol-httpclient/:

060123 180243 TOTAL urls:       117451
060123 180243 avg score:        1.021
060123 180243 max score:        194.0
060123 180243 min score:        1.0
060123 180243 retry 0:  52273
060123 180243 retry 1:  65178
060123 180243 status 1 (DB_unfetched):  89670
060123 180243 status 2 (DB_fetched):    26066
060123 180243 status 3 (DB_gone):       1715

Both return more or less the same results (w/ a difference of ~1.5% in
the #fetches which is not surprising on a 100k set).
I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts.
You were right, it actually makes sense that the settings in
/mapred-default.xml/ would affect the local crawl as well since they
have nothing to do w/ ndfs.
It therefore seems that /protocol-httpclient/ is reliable enough to be
used (well, at least in my case).

--Flo

Florent Gluck wrote:

>Andrzej Bialecki wrote:
>
>  
>
>>Could you please check (on a smaller sample ;-) ) which of these two
>>changes was necessary? Frist, second, or both? I suspect only the
>>second change was really needed, i.e. the change in config files, and
>>not the change of protocol-httpclient -> protocol-http ... It would be
>>very helpful if you could confirm/deny this.
>>
>>    
>>
>Well, I'm pretty much sure protocol-httpclient is part of the problem. 
>Earlier last week, I was trying to figure out what the problem was and I
>ran some crawls on single machine, using the local filesystem.  Here
>were my previous observations (from an older message):
>
>I injected 50000 urls and got 2315 urls fetched.  I couldn't find a
>trace in the logs of most of the urls.
>I noticed that if I put a counter at the beginning of the
>"/while(true)/*"* loop in the method /run/ in /Fetcher.java,/ I don't
>end up with 50000!
>After some poking around, I noticed that if I comment out the line doing
>the page fetch "/ProtocolOutput output = protocol.getProtocolOutput(key,
>datum);/", then I get 50000.
>There seems to be something really wrong with that.  I seems to mean
>that some threads are dying without notification in the http protocol
>code (if it makes any sense).
>I then decided to switch to using the old http protocol plugin:
>protocol-http (in nutch-default.xml) instead of protocol-httpclient.
>With the old protocol I got 50000 as expected.
>
>
>So to me it seems protocol-httpclient is buggy.  I'll still run a test
>with my current config and protocol-httpclient and let you know.
>-Flo
>
>  
>

Reply via email to