[jira] [Commented] (NUTCH-2548) Compressed content skipped. Content of size 78 was truncated to 74

Sebastian Nagel (JIRA) Mon, 09 Apr 2018 02:23:44 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430302#comment-16430302
 ]


Sebastian Nagel commented on NUTCH-2548:
----------------------------------------

Thanks, [~rustyx]! Confirmed for 2.x (using parsechecker), 1.x seems not 
affected.

> Compressed content skipped. Content of size 78 was truncated to 74
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2548
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.4
>            Reporter: Rustam
>            Priority: Major
>         Attachments: nutch-content-truncated.patch
>
>
> gzip or deflate compressed content fails to parse with a message like:
> {{WARN  parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped. 
> Content of size 78 was truncated to 74}}
> The root cause is that the original (compressed) Content-Length is stored in 
> the headers, while the content is stored uncompressed. Subsequently the 
> Content-Length doesn't match the stored content size.
> See attached patch that fixed the issue by removing Content-Length from the 
> headers if it contains compressed value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2548) Compressed content skipped. Content of size 78 was truncated to 74

Reply via email to