[
https://issues.apache.org/jira/browse/NUTCH-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095182#comment-13095182
]
Ferdy commented on NUTCH-1039:
------------------------------
This error is definitely caused by the server only incidentally returning an
*empty* contentlength. I know this because we had the same issues with nu.nl
before and this is actually the reason for me to open issue NUTCH-1096.
To conclude, there are 3 cases:
A) Server returns a valid contentlength: Integer is parsed and all goes well.
B) Server returns no contentlength: No integer is parsed, instead contentlength
is set to Integer.MAX_VALUE (of course it is still limited by
http.content.limit). Fetching will continue as normal.
C) Server returns an invalid contentlength, whether it be an empty string or
just plain garbage. This will result in an NumberFormatException followed by a
HttpException.
Your case is C, because *org.apache.nutch.protocol.http.api.HttpException: bad
content length:* indicates an empty contentlength.
To allow the cases with an empty string to proceed as normal I created the
patch in NUTCH-1096. Therefore this issue is somewhat of a duplicate of
NUTCH-1096. However I propose to close this issue as the title of this issue
indicates that it is about case B. (But as mentioned before this really was not
an issue in the first place.)
> Fetcher fails for pages without content-length header
> -----------------------------------------------------
>
> Key: NUTCH-1039
> URL: https://issues.apache.org/jira/browse/NUTCH-1039
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Fetcher fails:
> 2011-07-11 14:45:34,764 ERROR http.Http -
> org.apache.nutch.protocol.http.api.HttpException: bad content length:
> 2011-07-11 14:45:34,765 ERROR http.Http - at
> org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:218)
> 2011-07-11 14:45:34,765 ERROR http.Http - at
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:158)
> 2011-07-11 14:45:34,765 ERROR http.Http - at
> org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
> 2011-07-11 14:45:34,765 ERROR http.Http - at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
> 2011-07-11 14:45:34,765 ERROR http.Http - at
> org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:79)
> Both fetcher and indexing filter checker fail sometimes. I'm unsure whether
> this is something in Nutch or whether the remote server only returns
> content-length incidentally.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira