Sebastian Nagel created NUTCH-2707:
--------------------------------------

             Summary: protocol-okhttp fails to decompress gzip-encoded content
                 Key: NUTCH-2707
                 URL: https://issues.apache.org/jira/browse/NUTCH-2707
             Project: Nutch
          Issue Type: Bug
          Components: plugin, protocol
    Affects Versions: 1.15
            Reporter: Sebastian Nagel
             Fix For: 1.16


The plugin protocol-okhttp does not decompress the returned gzipped content for 
some rare pages.  Looks like that happens because the response HTTP header does 
not specify {{Content-Type: gzip}} but {{zlib,gzip,deflate}}.
{noformat}
% bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
      -Dstore.http.headers=true -Dstore.http.request=true \
      http://24310.gr/afroditi-42426.html
fetching: http://24310.gr/afroditi-42426.html 
...
contentType: application/gzip
...
Content Metadata: Transfer-Encoding=chunked ... 
Content-Encoding=zlib,gzip,deflate ... _request_=GET /afroditi-42426.html 
HTTP/1.1
...
Accept-Encoding: gzip

 _response.headers_=HTTP/1.1 200 OK
...
Content-Encoding: zlib,gzip,deflate
...
Transfer-Encoding: chunked
Connection: keep-alive
{noformat}

The plugin protocol-http requests {{Accept-Encoding: x-gzip, gzip, deflate}} 
and gets the correct response header:
{noformat}
% bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' \
       -Dstore.http.headers=true -Dstore.http.request=true 
http://24310.gr/afroditi-42426.html
...
contentType: application/xhtml+xml
...
Content Metadata: ... Content-Encoding=gzip ... _request_=GET 
/afroditi-42426.html HTTP/1.1
Host: 24310.gr
Accept-Encoding: x-gzip, gzip, deflate
...
{noformat}

Similar for Firefox which sends {{Accept-Encoding: gzip, deflate}}.

I will report the issue to upstream okhttp. But it would be also possible to 
handle the content encoding in the protocol implementation: if the 
Accept-Encoding header is set, the okhttp library will not decompress the 
content and expects that it's handled in the calling code.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to