[
https://issues.apache.org/jira/browse/NUTCH-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1342:
---------------------------------
Attachment: NUTCH-1342-1.6-1.patch
Patch for 1.6. This patch changes the behavior when a read time out occurs.
Currently the SocketTimeoutException is propagated to higher level code without
checking for edge-cases. This patch assumes that if bytes where received and no
Content-Length header was specified, the read data is alright.
This change definately fixes read time out problems caused by badly configured
servers but still relies on the connection to time out.
Please comment!
> Read time out protocol-http
> ---------------------------
>
> Key: NUTCH-1342
> URL: https://issues.apache.org/jira/browse/NUTCH-1342
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4, 1.5
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.6
>
> Attachments: NUTCH-1342-1.6-1.patch
>
>
> For some reason some URL's always time out with protocol-http but not
> protocol-httpclient. The stack trace is always the same:
> {code}
> 2012-04-20 11:25:44,275 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at java.io.FilterInputStream.read(FilterInputStream.java:116)
> at java.io.PushbackInputStream.read(PushbackInputStream.java:169)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at
> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:228)
> at
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:157)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> {code}
> Some example URL's:
> * 404 http://www.fcgroningen.nl/tribunenamen/stemmen/
> * 301 http://shop.fcgroningen.nl/aanbieding
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira