Brian created NUTCH-1613:
----------------------------

             Summary: Timeouts in protocol-httpclient when crawling same host 
with >2 threads and added cookie strings for both http protocols
                 Key: NUTCH-1613
                 URL: https://issues.apache.org/jira/browse/NUTCH-1613
             Project: Nutch
          Issue Type: Bug
          Components: protocol
    Affects Versions: 2.2.1
            Reporter: Brian
            Priority: Minor


1.)  When using protocol-httpclient to crawl a single website (the same host) I 
would always get a bunch of timeout errors during fetching and the pages with 
errors would not be fetched. E.g.:

2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www.... 
failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
Timeout waiting for connection
2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www.... 
(queue crawl delay=0ms)
2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
error: 
org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
for connection
        at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
        at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
        at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
        at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
        at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
        at 
org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:95)
        at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
        at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
        at 
org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)

This is because by default the connection pool manager only allows 2 
connections per host so if more than 2 threads are used the others will tend to 
time out waiting to get a connection.   The code previously set max connections 
correctly but not connection per host.


2.) I also added at the same time simple modifications to both protocol-http 
and protocol-httpclient to allow specifying a cookie string in the conf file to 
include in request headers.  

I use this to crawl site content requiring authentication - it is better for me 
to specify the cookie string for the authentication than go through the whole 
authentication process and specifying login info.

The nutch-site.xml property is the following:

<property>
        <name>http.cookie_string</name>
        <value>XX_AL=authorization_value_goes_here</value>
                <description>String to use as the cookie value for HTTP 
requests</description>
</property>


Although I use it for authentication it can be used to specify any single 
cookie string for the crawl (httpclient does support different cookies for 
different hosts but I did not get into that).


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to