rusty x created NUTCH-2548:
------------------------------
Summary: Compressed content skipped. Content of size 78 was
truncated to 74
Key: NUTCH-2548
URL: https://issues.apache.org/jira/browse/NUTCH-2548
Project: Nutch
Issue Type: Bug
Affects Versions: 2.4
Reporter: rusty x
Attachments: nutch-content-truncated.patch
gzip or deflate compressed content fails to parse with a message like:
{{WARN parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped. Content
of size 78 was truncated to 74}}
The root cause is that the original (compressed) Content-Length is stored in
the headers, while the content is stored uncompressed. Subsequently the
Content-Length doesn't match the stored content size.
See attached patch that fixed the issue by removing Content-Length from the
headers if it contains compressed value.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)