Andrzej, I ran 2 crawls of 1 pass each, injecting 100'000 urls. Here is the output of /readdb -stats/ when crawling with /protocol-http/:
060123 162250 TOTAL urls: 119221 060123 162250 avg score: 1.023 060123 162250 max score: 240.666 060123 162250 min score: 1.0 060123 162250 retry 0: 56648 060123 162250 retry 1: 62573 060123 162250 status 1 (DB_unfetched): 89068 060123 162250 status 2 (DB_fetched): 27513 060123 162250 status 3 (DB_gone): 2640 And here is the output when crawling with /protocol-httpclient/: 060123 180243 TOTAL urls: 117451 060123 180243 avg score: 1.021 060123 180243 max score: 194.0 060123 180243 min score: 1.0 060123 180243 retry 0: 52273 060123 180243 retry 1: 65178 060123 180243 status 1 (DB_unfetched): 89670 060123 180243 status 2 (DB_fetched): 26066 060123 180243 status 3 (DB_gone): 1715 Both return more or less the same results (w/ a difference of ~1.5% in the #fetches which is not surprising on a 100k set). I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts. You were right, it actually makes sense that the settings in /mapred-default.xml/ would affect the local crawl as well since they have nothing to do w/ ndfs. It therefore seems that /protocol-httpclient/ is reliable enough to be used (well, at least in my case). --Flo Florent Gluck wrote: >Andrzej Bialecki wrote: > > > >>Could you please check (on a smaller sample ;-) ) which of these two >>changes was necessary? Frist, second, or both? I suspect only the >>second change was really needed, i.e. the change in config files, and >>not the change of protocol-httpclient -> protocol-http ... It would be >>very helpful if you could confirm/deny this. >> >> >> >Well, I'm pretty much sure protocol-httpclient is part of the problem. >Earlier last week, I was trying to figure out what the problem was and I >ran some crawls on single machine, using the local filesystem. Here >were my previous observations (from an older message): > >I injected 50000 urls and got 2315 urls fetched. I couldn't find a >trace in the logs of most of the urls. >I noticed that if I put a counter at the beginning of the >"/while(true)/*"* loop in the method /run/ in /Fetcher.java,/ I don't >end up with 50000! >After some poking around, I noticed that if I comment out the line doing >the page fetch "/ProtocolOutput output = protocol.getProtocolOutput(key, >datum);/", then I get 50000. >There seems to be something really wrong with that. I seems to mean >that some threads are dying without notification in the http protocol >code (if it makes any sense). >I then decided to switch to using the old http protocol plugin: >protocol-http (in nutch-default.xml) instead of protocol-httpclient. >With the old protocol I got 50000 as expected. > > >So to me it seems protocol-httpclient is buggy. I'll still run a test >with my current config and protocol-httpclient and let you know. >-Flo > > >