[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508019#comment-16508019
 ] 

Sebastian Nagel commented on NUTCH-2557:
----------------------------------------

Hi [~omkar20895], hi [~gbouchar], [PR 
#347|https://github.com/apache/nutch/pull/347] contains Gerard's solution for 
this issue, see [commit 
d163512|https://github.com/apache/nutch/pull/347/commits/d163512d5d2e345dfe6c816a29dc93a108dfd254].
 It does not skip reading payload content for redirects and other non-200 
responses. But if reading the payload throws an exception, the exception is 
caught and ignored. Since it only affects responses which would fail otherwise, 
I've decided not introduce a new property. Let me know whether this is ok. 
Thanks!

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2557
>             Project: Nutch
>          Issue Type: Sub-task
>    Affects Versions: 1.14
>            Reporter: Gerard Bouchar
>            Priority: Major
>             Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to