Julien Nioche created NUTCH-1919:
------------------------------------

             Summary: Getting timeout when server returns Content-Length: 0 
                 Key: NUTCH-1919
                 URL: https://issues.apache.org/jira/browse/NUTCH-1919
             Project: Nutch
          Issue Type: Bug
          Components: protocol
            Reporter: Julien Nioche
             Fix For: 1.10


This has been investigated in fixed in the Storm-Crawler 
[https://github.com/DigitalPebble/storm-crawler/issues/48].

{quote}
curl -I "http://www.dailynewslosangeles.com/";
HTTP/1.1 301 Moved Permanently
Location: http://www.dailynews.com
Connection: close
Content-Length: 0
Content-Type: text/html; charset=UTF-8
{quote}

when fetching with Nutch we are getting a timeout exception :

{quote}
./nutch parsechecker -D http.agent.name="PebbleCrawler" 
"http://www.dailynewslosangeles.com/";
fetching: http://www.dailynewslosangeles.com/
Fetch failed with protocol status: exception(16), lastModified=0: 
java.net.SocketTimeoutException: Read timed out
{quote}

The reason for this is that we are trying to read from the stream even though 
we know that the content length is 0.

The patch attached fixes the issue. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to