Julien Nioche created NUTCH-1919:
------------------------------------
Summary: Getting timeout when server returns Content-Length: 0
Key: NUTCH-1919
URL: https://issues.apache.org/jira/browse/NUTCH-1919
Project: Nutch
Issue Type: Bug
Components: protocol
Reporter: Julien Nioche
Fix For: 1.10
This has been investigated in fixed in the Storm-Crawler
[https://github.com/DigitalPebble/storm-crawler/issues/48].
{quote}
curl -I "http://www.dailynewslosangeles.com/"
HTTP/1.1 301 Moved Permanently
Location: http://www.dailynews.com
Connection: close
Content-Length: 0
Content-Type: text/html; charset=UTF-8
{quote}
when fetching with Nutch we are getting a timeout exception :
{quote}
./nutch parsechecker -D http.agent.name="PebbleCrawler"
"http://www.dailynewslosangeles.com/"
fetching: http://www.dailynewslosangeles.com/
Fetch failed with protocol status: exception(16), lastModified=0:
java.net.SocketTimeoutException: Read timed out
{quote}
The reason for this is that we are trying to read from the stream even though
we know that the content length is 0.
The patch attached fixes the issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)