[ https://issues.apache.org/jira/browse/NUTCH-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-633: -------------------------------- Attachment: NUTCH_633.patch OK, I shouldn't have missed this one :) Anyway, I think it is better to modify the fetchers so that they always store FETCH_STATUS_KEY instead of modifying parser. And, here is a patch which does exactly that :D > ParseSegment no longer allow reparsing > -------------------------------------- > > Key: NUTCH-633 > URL: https://issues.apache.org/jira/browse/NUTCH-633 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.0.0 > Environment: any > Reporter: Xue Yong Zhi > Priority: Minor > Attachments: NUTCH_633.patch > > > ParseSegment used to allow reparsing even if parsing has been enabled in > Fetcher. But now it throws a NumberFormatException as > 'content.getMetadata().get(Nutch.FETCH_STATUS_KEY)' is null. > This patch will fix the problem: > --- a/src/java/org/apache/nutch/parse/ParseSegment.java > +++ b/src/java/org/apache/nutch/parse/ParseSegment.java > @@ -70,8 +70,10 @@ public class ParseSegment extends Configured implements > Tool, Mapper<WritableCom > key = newKey; > } > > + //status_key is only available when parsing is not done in fetcher > + String status_key = content.getMetadata().get(Nutch.FETCH_STATUS_KEY); > int status = > - Integer.parseInt(content.getMetadata().get(Nutch.FETCH_STATUS_KEY)); > + (null == status_key) ? CrawlDatum.STATUS_FETCH_SUCCESS : > Integer.parseInt(status_key); > if (status != CrawlDatum.STATUS_FETCH_SUCCESS) { > // content not fetched successfully, skip document > LOG.debug("Skipping " + key + " as content is not fetched > successfully"); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.