[
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508019#comment-16508019
]
Sebastian Nagel commented on NUTCH-2557:
----------------------------------------
Hi [~omkar20895], hi [~gbouchar], [PR
#347|https://github.com/apache/nutch/pull/347] contains Gerard's solution for
this issue, see [commit
d163512|https://github.com/apache/nutch/pull/347/commits/d163512d5d2e345dfe6c816a29dc93a108dfd254].
It does not skip reading payload content for redirects and other non-200
responses. But if reading the payload throws an exception, the exception is
caught and ignored. Since it only affects responses which would fail otherwise,
I've decided not introduce a new property. Let me know whether this is ok.
Thanks!
> protocol-http fails to follow redirections when an HTTP response body is
> invalid
> --------------------------------------------------------------------------------
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
> Issue Type: Sub-task
> Affects Versions: 1.14
> Reporter: Gerard Bouchar
> Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header),
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error
> occurs while decoding the body, the redirection is not followed and the
> information is lost. Browsers follow the redirection and close the socket
> soon as they can.
> * Example: this page is a redirection to its https version, with an HTTP
> body containing invalidly gzip encoded contents. Browsers follow the
> redirection, but nutch throws an error:
> ** [http://www.webarcelona.net/es/blog?page=2]
>
> The HttpResponse::getContent class can already return null. I think it should
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try
> parsing the body when the headers indicate a redirection.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)