[
https://issues.apache.org/jira/browse/NUTCH-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2716.
------------------------------------
Resolution: Fixed
Merged [PR #454|https://github.com/apache/nutch/pull/454]. Thanks, [~yossi]!
??Even when store.http.headers=true, the HTTP headers are not saved for a
gzipped or deflated response, because they may contain an incorrect
content-length header. This causes WARCExporter to generate "resource"
(header-less) entries instead of "response" entries. The correct behavior is to
store all the headers, and code that uses them should be aware and careful that
they represent the original headers, not the stored content.??
??This fixes protocol-http, protocol-selenium, and protocol-htmlunit to write
the raw response headers, and adds logic to WARCExporter and
CommonCrawlDataDumper to fix these headers.??
??It also fixed NUTCH-2715 (WARCExporter fails on large records), and upgrades
lib-htmlunit to use version 3.141.5 of Selenium, since Eclipse fails to compile
otherwise (conflicts with lib-selenium).??
> protocol-http: Response headers are not stored for a compressed response
> ------------------------------------------------------------------------
>
> Key: NUTCH-2716
> URL: https://issues.apache.org/jira/browse/NUTCH-2716
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 1.15
> Reporter: Yossi Tamari
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.16
>
>
> Even when store.http.headers=true, the HTTP headers are not saved for a
> gzipped or deflated response, because they may contain an incorrect
> content-length header.
> This causes WARCExporter to generate "resource" (headerless) entries instead
> of "response" entries.
> While I can see why reporting the wrong content-encoding and length may be a
> bug, removing all the headers is not a fix.
> I am not submitting a patch yet since I'm not sure what the best fix is, but
> I guess the best patch is to remove those two header lines and store the rest
> of the headers. If there is no objection, I can submit a patch that does
> this. Otherwise, what would be a better fix?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)