[
https://issues.apache.org/jira/browse/NUTCH-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430302#comment-16430302
]
Sebastian Nagel commented on NUTCH-2548:
----------------------------------------
Thanks, [~rustyx]! Confirmed for 2.x (using parsechecker), 1.x seems not
affected.
> Compressed content skipped. Content of size 78 was truncated to 74
> ------------------------------------------------------------------
>
> Key: NUTCH-2548
> URL: https://issues.apache.org/jira/browse/NUTCH-2548
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 2.4
> Reporter: Rustam
> Priority: Major
> Attachments: nutch-content-truncated.patch
>
>
> gzip or deflate compressed content fails to parse with a message like:
> {{WARN parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped.
> Content of size 78 was truncated to 74}}
> The root cause is that the original (compressed) Content-Length is stored in
> the headers, while the content is stored uncompressed. Subsequently the
> Content-Length doesn't match the stored content size.
> See attached patch that fixed the issue by removing Content-Length from the
> headers if it contains compressed value.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)