JIRA
Fri, 19 Sep 2008 06:19:10 -0700
[
https://issues.apache.org/jira/browse/NUTCH-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Doğacan Güney updated NUTCH-633:
--------------------------------
Attachment: NUTCH_633.patch
OK, I shouldn't have missed this one :)
Anyway, I think it is better to modify the fetchers so that they always store
FETCH_STATUS_KEY instead of modifying parser.
And, here is a patch which does exactly that :D
> ParseSegment no longer allow reparsing
> --------------------------------------
>
> Key: NUTCH-633
> URL: https://issues.apache.org/jira/browse/NUTCH-633
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.0.0
> Environment: any
> Reporter: Xue Yong Zhi
> Priority: Minor
> Attachments: NUTCH_633.patch
>
>
> ParseSegment used to allow reparsing even if parsing has been enabled in
> Fetcher. But now it throws a NumberFormatException as
> 'content.getMetadata().get(Nutch.FETCH_STATUS_KEY)' is null.
> This patch will fix the problem:
> --- a/src/java/org/apache/nutch/parse/ParseSegment.java
> +++ b/src/java/org/apache/nutch/parse/ParseSegment.java
> @@ -70,8 +70,10 @@ public class ParseSegment extends Configured implements
> Tool, Mapper<WritableCom
> key = newKey;
> }
>
> + //status_key is only available when parsing is not done in fetcher
> + String status_key = content.getMetadata().get(Nutch.FETCH_STATUS_KEY);
> int status =
> - Integer.parseInt(content.getMetadata().get(Nutch.FETCH_STATUS_KEY));
> + (null == status_key) ? CrawlDatum.STATUS_FETCH_SUCCESS :
> Integer.parseInt(status_key);
> if (status != CrawlDatum.STATUS_FETCH_SUCCESS) {
> // content not fetched successfully, skip document
> LOG.debug("Skipping " + key + " as content is not fetched
> successfully");
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.