[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16488985#comment-16488985
 ] 

Sebastian Nagel edited comment on NUTCH-2557 at 5/24/18 1:32 PM:
-----------------------------------------------------------------

See [comments in 
NUTCH-2549|https://issues.apache.org/jira/browse/NUTCH-2549?focusedCommentId=16430591&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16430591]:
 by default content from redirect and 404s should be ignored but it should be 
possible to optionally fetch and store the content (eg. by adding a property 
{{http.content.store.404}}).


was (Author: wastl-nagel):
See [comments in 
NUTCH-2549|https://issues.apache.org/jira/browse/NUTCH-2549?focusedCommentId=16430591&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16430591]:
 please try to make this the default but allow to optionally fetch and store 
the content.

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2557
>             Project: Nutch
>          Issue Type: Sub-task
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to