Andrzej Bialecki wrote:

> Florent Gluck wrote:
>
>> Hi Mike,
>>
>> I finally got everything working properly!
>> What I did was to switch to /protocol-http/ and move the following from
>> /nutch-site.xml/ to /mapred-default.xml/:
>>   
>
>
> Could you please check (on a smaller sample ;-) ) which of these two
> changes was necessary? Frist, second, or both? I suspect only the
> second change was really needed, i.e. the change in config files, and
> not the change of protocol-httpclient -> protocol-http ... It would be
> very helpful if you could confirm/deny this.
>
Well, I'm pretty much sure protocol-httpclient is part of the problem. 
Earlier last week, I was trying to figure out what the problem was and I
ran some crawls on single machine, using the local filesystem.  Here
were my previous observations (from an older message):

I injected 50000 urls and got 2315 urls fetched.  I couldn't find a
trace in the logs of most of the urls.
I noticed that if I put a counter at the beginning of the
"/while(true)/*"* loop in the method /run/ in /Fetcher.java,/ I don't
end up with 50000!
After some poking around, I noticed that if I comment out the line doing
the page fetch "/ProtocolOutput output = protocol.getProtocolOutput(key,
datum);/", then I get 50000.
There seems to be something really wrong with that.  I seems to mean
that some threads are dying without notification in the http protocol
code (if it makes any sense).
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient.
With the old protocol I got 50000 as expected.


So to me it seems protocol-httpclient is buggy.  I'll still run a test
with my current config and protocol-httpclient and let you know.
-Flo

Reply via email to