Sebastian Nagel created NUTCH-2707:

             Summary: protocol-okhttp fails to decompress gzip-encoded content
                 Key: NUTCH-2707
             Project: Nutch
          Issue Type: Bug
          Components: plugin, protocol
    Affects Versions: 1.15
            Reporter: Sebastian Nagel
             Fix For: 1.16

The plugin protocol-okhttp does not decompress the returned gzipped content for 
some rare pages.  Looks like that happens because the response HTTP header does 
not specify {{Content-Type: gzip}} but {{zlib,gzip,deflate}}.
% bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
      -Dstore.http.headers=true -Dstore.http.request=true \
contentType: application/gzip
Content Metadata: Transfer-Encoding=chunked ... 
Content-Encoding=zlib,gzip,deflate ... _request_=GET /afroditi-42426.html 
Accept-Encoding: gzip

 _response.headers_=HTTP/1.1 200 OK
Content-Encoding: zlib,gzip,deflate
Transfer-Encoding: chunked
Connection: keep-alive

The plugin protocol-http requests {{Accept-Encoding: x-gzip, gzip, deflate}} 
and gets the correct response header:
% bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' \
       -Dstore.http.headers=true -Dstore.http.request=true
contentType: application/xhtml+xml
Content Metadata: ... Content-Encoding=gzip ... _request_=GET 
/afroditi-42426.html HTTP/1.1
Accept-Encoding: x-gzip, gzip, deflate

Similar for Firefox which sends {{Accept-Encoding: gzip, deflate}}.

I will report the issue to upstream okhttp. But it would be also possible to 
handle the content encoding in the protocol implementation: if the 
Accept-Encoding header is set, the okhttp library will not decompress the 
content and expects that it's handled in the calling code.

This message was sent by Atlassian JIRA

Reply via email to