[
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215530#comment-13215530
]
Ferdy Galema commented on NUTCH-965:
------------------------------------
Hi Markus,
For nutchtrunk I performed the following testcrawls and it worked as expected
(for urls that are NOT truncated)
-fetching and separate parsing (parser.skip.truncated to true)
-fetching with parsing (parser.skip.truncated to true)
-fetching and separate parsing (parser.skip.truncated to false)
-fetching with parsing (parser.skip.truncated to false)
I did the same for nutchgora. So this is to verify that for nontruncated urls
everything works as before.
For urls that _are_ truncated, I debugged a crawl and artifically changed the
size to check that parsing is skipped. But only when the parser.skip.truncated
is set to true. This works too.
In short, yes it has been fixed.
> Skip parsing for truncated documents
> ------------------------------------
>
> Key: NUTCH-965
> URL: https://issues.apache.org/jira/browse/NUTCH-965
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Reporter: Alexis
> Assignee: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-965-v2.patch, NUTCH-965-v3-nutchgora.txt,
> NUTCH-965-v3-trunk.txt, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is
> described here:
> http://www.mail-archive.com/[email protected]/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted
> data due to for example truncating big binary files at fetch time.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira