[ 
https://issues.apache.org/jira/browse/NUTCH-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095182#comment-13095182
 ] 

Ferdy commented on NUTCH-1039:
------------------------------

This error is definitely caused by the server only incidentally returning an 
*empty* contentlength. I know this because we had the same issues with nu.nl 
before and this is actually the reason for me to open issue NUTCH-1096.

To conclude, there are 3 cases:

A) Server returns a valid contentlength: Integer is parsed and all goes well.
B) Server returns no contentlength: No integer is parsed, instead contentlength 
is set to Integer.MAX_VALUE (of course it is still limited by 
http.content.limit). Fetching will continue as normal.
C) Server returns an invalid contentlength, whether it be an empty string or 
just plain garbage. This will result in an NumberFormatException followed by a 
HttpException. 

Your case is C, because *org.apache.nutch.protocol.http.api.HttpException: bad 
content length:* indicates an empty contentlength.

To allow the cases with an empty string to proceed as normal I created the 
patch in NUTCH-1096. Therefore this issue is somewhat of a duplicate of 
NUTCH-1096. However I propose to close this issue as the title of this issue 
indicates that it is about case B. (But as mentioned before this really was not 
an issue in the first place.)

> Fetcher fails for pages without content-length header
> -----------------------------------------------------
>
>                 Key: NUTCH-1039
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1039
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> Fetcher fails:
> 2011-07-11 14:45:34,764 ERROR http.Http - 
> org.apache.nutch.protocol.http.api.HttpException: bad content length:
> 2011-07-11 14:45:34,765 ERROR http.Http - at 
> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:218)
> 2011-07-11 14:45:34,765 ERROR http.Http - at 
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:158)
> 2011-07-11 14:45:34,765 ERROR http.Http - at 
> org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> 2011-07-11 14:45:34,765 ERROR http.Http - at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> 2011-07-11 14:45:34,765 ERROR http.Http - at 
> org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:79)
> Both fetcher and indexing filter checker fail sometimes. I'm unsure whether 
> this is something in Nutch or whether the remote server only returns 
> content-length incidentally.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to