rusty x created NUTCH-2548:
------------------------------

             Summary: Compressed content skipped. Content of size 78 was 
truncated to 74
                 Key: NUTCH-2548
                 URL: https://issues.apache.org/jira/browse/NUTCH-2548
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.4
            Reporter: rusty x
         Attachments: nutch-content-truncated.patch

gzip or deflate compressed content fails to parse with a message like:

{{WARN  parse.ParserJob - https://rustyx.org/temp/index%20bbb skipped. Content 
of size 78 was truncated to 74}}

The root cause is that the original (compressed) Content-Length is stored in 
the headers, while the content is stored uncompressed. Subsequently the 
Content-Length doesn't match the stored content size.

See attached patch that fixed the issue by removing Content-Length from the 
headers if it contains compressed value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to