[ 
https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2716.
------------------------------------
    Resolution: Fixed

Merged [PR #454|https://github.com/apache/nutch/pull/454]. Thanks, [~yossi]!

??Even when store.http.headers=true, the HTTP headers are not saved for a 
gzipped or deflated response, because they may contain an incorrect 
content-length header. This causes WARCExporter to generate "resource" 
(header-less) entries instead of "response" entries. The correct behavior is to 
store all the headers, and code that uses them should be aware and careful that 
they represent the original headers, not the stored content.??

??This fixes protocol-http, protocol-selenium, and protocol-htmlunit to write 
the raw response headers, and adds logic to WARCExporter and 
CommonCrawlDataDumper to fix these headers.??

??It also fixed NUTCH-2715 (WARCExporter fails on large records), and upgrades 
lib-htmlunit to use version 3.141.5 of Selenium, since Eclipse fails to compile 
otherwise (conflicts with lib-selenium).??

> protocol-http: Response headers are not stored for a compressed response
> ------------------------------------------------------------------------
>
>                 Key: NUTCH-2716
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2716
>             Project: Nutch
>          Issue Type: Bug
>          Components: protocol
>    Affects Versions: 1.15
>            Reporter: Yossi Tamari
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.16
>
>
> Even when store.http.headers=true, the HTTP headers are not saved for a 
> gzipped or deflated response, because they may contain an incorrect 
> content-length header.
> This causes WARCExporter to generate "resource" (headerless) entries instead 
> of "response" entries.
> While I can see why reporting the wrong content-encoding and length may be a 
> bug, removing all the headers is not a fix.
> I am not submitting a patch yet since I'm not sure what the best fix is, but 
> I guess the best patch is to remove those two header lines and store the rest 
> of the headers. If there is no objection, I can submit a patch that does 
> this. Otherwise, what would be a better fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to