patrick peck created NUTCH-2067:
-----------------------------------
Summary: HttpFormAuthentication unable to decode login page when
server responds with GZIP encoding
Key: NUTCH-2067
URL: https://issues.apache.org/jira/browse/NUTCH-2067
Project: Nutch
Issue Type: Bug
Components: plugin, protocol
Affects Versions: 1.10
Reporter: patrick peck
The method
org.apache.nutch.protocol.httpclient.HttpFormAuthentication#httpGetPageContent()
which is used to download the login page when doing form authentication, fails
to take into account that the response body may be gzip encoded which is
possible given the fact that the Http.configureClient() method sets the
Accept-Encoding header to "x-gzip, gzip, deflate".
It's also not possible to override the Accept-Encoding header, since it's
overridden by the default (or, to be more exact: if you add an
<additionalPostHeaders>
<field name="Accept-Encoding" value="identity" />
</additionalPostHeaders>
to the configuration, the http client sends out the Accept-Encoding header
twice, first with the above configuration, second with the default
configuration.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)