[
https://issues.apache.org/jira/browse/NUTCH-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811949#comment-16811949
]
Sebastian Nagel commented on NUTCH-2707:
----------------------------------------
Turns out that there are few more servers which does not conform to the
standard and answer on a request with {{Accept-Encoding: gzip}} with something
different than {{Content-Encoding: gzip}} or {{Content-Encoding: identity}}. We
should at least try to handle most of these cases. Further examples:
- same as the initial problem:
{noformat}
% nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika'
-Dstore.http.headers=true -Dstore.http.request=true
https://saudibusiness.directory/%D8%A7%D8%AB%D8%A7%D8%AB%D9%83%D9%88%D9%85-5741.html
...
contentType: application/gzip
...
Content Metadata: ... _request_=GET
/%D8%A7%D8%AB%D8%A7%D8%AB%D9%83%D9%88%D9%85-5741.html HTTP/1.1
...
Accept-Encoding: gzip
_response.headers_=HTTP/1.1 200 OK
Date: Sun, 07 Apr 2019 14:19:15 GMT
Server: Apache
X-Powered-By: PHP/5.6.30
Content-Encoding: zlib,gzip,deflate
...
{noformat}
- response uses "deflate" although "gzip" is requested:
{noformat}
% bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika'
-Dstore.http.headers=true -Dstore.http.request=true
https://de.wantedly.com/all/japan/designer/businessmodel
...
contentType: application/zlib
...
Content Metadata: ... _request_=GET /all/japan/designer/businessmodel HTTP/1.1
...
Accept-Encoding: gzip
... _response.headers_=HTTP/1.1 200 OK
Date: Fri, 05 Apr 2019 15:56:01 GMT
...
Server: nginx
...
Content-Encoding: deflate
...
{noformat}
> protocol-okhttp fails to decompress content if Content-Encoding header is
> wrong
> -------------------------------------------------------------------------------
>
> Key: NUTCH-2707
> URL: https://issues.apache.org/jira/browse/NUTCH-2707
> Project: Nutch
> Issue Type: Bug
> Components: plugin, protocol
> Affects Versions: 1.15
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.16
>
>
> The plugin protocol-okhttp does not decompress the returned gzipped content
> for some rare pages. Looks like that happens because the response HTTP
> header does not specify {{Content-Type: gzip}} but {{zlib,gzip,deflate}}.
> {noformat}
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
> -Dstore.http.headers=true -Dstore.http.request=true \
> http://24310.gr/afroditi-42426.html
> fetching: http://24310.gr/afroditi-42426.html
> ...
> contentType: application/gzip
> ...
> Content Metadata: Transfer-Encoding=chunked ...
> Content-Encoding=zlib,gzip,deflate ... _request_=GET /afroditi-42426.html
> HTTP/1.1
> ...
> Accept-Encoding: gzip
> _response.headers_=HTTP/1.1 200 OK
> ...
> Content-Encoding: zlib,gzip,deflate
> ...
> Transfer-Encoding: chunked
> Connection: keep-alive
> {noformat}
> The plugin protocol-http requests {{Accept-Encoding: x-gzip, gzip, deflate}}
> and gets the correct response header:
> {noformat}
> % bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' \
> -Dstore.http.headers=true -Dstore.http.request=true
> http://24310.gr/afroditi-42426.html
> ...
> contentType: application/xhtml+xml
> ...
> Content Metadata: ... Content-Encoding=gzip ... _request_=GET
> /afroditi-42426.html HTTP/1.1
> Host: 24310.gr
> Accept-Encoding: x-gzip, gzip, deflate
> ...
> {noformat}
> Similar for Firefox which sends {{Accept-Encoding: gzip, deflate}}.
> I will report the issue to upstream okhttp. But it would be also possible to
> handle the content encoding in the protocol implementation: if the
> Accept-Encoding header is set, the okhttp library will not decompress the
> content and expects that it's handled in the calling code.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)