[ https://issues.apache.org/jira/browse/NUTCH-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1308: ----------------------------------- Fix Version/s: (was: 2.5) 1.13 > Add main() to ZipParser > ----------------------- > > Key: NUTCH-1308 > URL: https://issues.apache.org/jira/browse/NUTCH-1308 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 1.4, nutchgora > Reporter: Lewis John McGibbney > Assignee: Sebastian Nagel > Fix For: 1.13 > > Attachments: NUTCH-1308-ZipParser-main-trunk.patch > > > Two issues here... > 1) Recently ferdy committed NUTCH-965 which skips parsing of truncated > documents. Parse zip has it's own implementation for the same when it should > really draw on the aforementioned implementation. > 2) If (in the offending piece of code mentioned above) truncation occurs, we > get an incorrect log message the "Parser can't handle incomplete pdf > files"!!! This is incorrect, shouldn't be there, and should be removed. > {code} > 72 if (contentLen != null && contentInBytes.length != len) { > 73 return new ParseStatus(ParseStatus.FAILED, > 74 ParseStatus.FAILED_TRUNCATED, "Content truncated at " > 75 + contentInBytes.length > 76 + " bytes. Parser can't handle incomplete pdf file.") > 77 .getEmptyParseResult(content.getUrl(), getConf()); > 78 } > {code} > For clarity, the issue is present in both Nutchgora branch[1] and Nutch > trunk[2] > [1] > https://svn.apache.org/viewvc/nutch/branches/nutchgora/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=h&view=markup > [2] > https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=h&view=markup > [2] -- This message was sent by Atlassian JIRA (v6.3.4#6332)