[ 
https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714701#comment-13714701
 ] 

lufeng commented on NUTCH-1613:
-------------------------------

ok, Does this cookie will effect other urls that these urls don't need any 
cookie and will receive "Bad Request" error when using httpclient? It seems not 
very general so can we able to add a filter to specify the different host using 
a different cookie.
                
> Timeouts in protocol-httpclient when crawling same host with >2 threads and 
> added cookie strings for both http protocols
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1613
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1613
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 2.2.1
>            Reporter: Brian
>            Priority: Minor
>              Labels: patch
>             Fix For: 2.3
>
>         Attachments: NUTCH-1613.patch
>
>
> 1.)  When using protocol-httpclient to crawl a single website (the same host) 
> I would always get a bunch of timeout errors during fetching and the pages 
> with errors would not be fetched. E.g.:
> 2013-07-09 17:57:13,717 WARN  fetcher.FetcherJob - fetch of http://www.... 
> failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: 
> Timeout waiting for connection
> 2013-07-09 17:57:13,718 INFO  fetcher.FetcherJob - fetching http://www.... 
> (queue crawl delay=0ms)
> 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following 
> error: 
> org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting 
> for connection
>       at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497)
>       at 
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
>       at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
>       at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
>       at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
>       at 
> org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:95)
>       at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174)
>       at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133)
>       at 
> org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518)
> This is because by default the connection pool manager only allows 2 
> connections per host so if more than 2 threads are used the others will tend 
> to time out waiting to get a connection.   The code previously set max 
> connections correctly but not connection per host.
> 2.) I also added at the same time simple modifications to both protocol-http 
> and protocol-httpclient to allow specifying a cookie string in the conf file 
> to include in request headers.  
> I use this to crawl site content requiring authentication - it is better for 
> me to specify the cookie string for the authentication than go through the 
> whole authentication process and specifying login info.
> The nutch-site.xml property is the following:
> <property>
>         <name>http.cookie_string</name>
>         <value>XX_AL=authorization_value_goes_here</value>
>               <description>String to use as the cookie value for HTTP 
> requests</description>
> </property>
> Although I use it for authentication it can be used to specify any single 
> cookie string for the crawl (httpclient does support different cookies for 
> different hosts but I did not get into that).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to