Andrzej Bialecki wrote: > Florent Gluck wrote: > >> Hi Mike, >> >> I finally got everything working properly! >> What I did was to switch to /protocol-http/ and move the following from >> /nutch-site.xml/ to /mapred-default.xml/: >> > > > Could you please check (on a smaller sample ;-) ) which of these two > changes was necessary? Frist, second, or both? I suspect only the > second change was really needed, i.e. the change in config files, and > not the change of protocol-httpclient -> protocol-http ... It would be > very helpful if you could confirm/deny this. > Well, I'm pretty much sure protocol-httpclient is part of the problem. Earlier last week, I was trying to figure out what the problem was and I ran some crawls on single machine, using the local filesystem. Here were my previous observations (from an older message):
I injected 50000 urls and got 2315 urls fetched. I couldn't find a trace in the logs of most of the urls. I noticed that if I put a counter at the beginning of the "/while(true)/*"* loop in the method /run/ in /Fetcher.java,/ I don't end up with 50000! After some poking around, I noticed that if I comment out the line doing the page fetch "/ProtocolOutput output = protocol.getProtocolOutput(key, datum);/", then I get 50000. There seems to be something really wrong with that. I seems to mean that some threads are dying without notification in the http protocol code (if it makes any sense). I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient. With the old protocol I got 50000 as expected. So to me it seems protocol-httpclient is buggy. I'll still run a test with my current config and protocol-httpclient and let you know. -Flo
