[jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

Emmanuel Joke (JIRA) Fri, 04 Jan 2008 03:04:58 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555890#action_12555890
 ]


Emmanuel Joke commented on NUTCH-596:
-------------------------------------

I agree with you the proper solution will be the third one. However i don't 
like the idea to have time and processor consuming for record that we are not 
inetrested in.

What do you think ?

> ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-596
>                 URL: https://issues.apache.org/jira/browse/NUTCH-596
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>            Reporter: Emmanuel Joke
>
> We have 2 choices to parse the content either within the Fetcher class or 
> with the ParseSegment class
> Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS 
> nad if its true it will parse the content.
> However we don't have this check in ParseSegment, thus we parse every content 
> store on the disk without checking the Status.
> So i think we should implement this check, i can see only 3 solutions:
> - read the status code in the Metadata of the Content object
> - don't store content for fetch with a crawldatun <>  STATUS_FETCH_SUCCESS
> - load the crawldatum object in ParseSegement
> What are your thoughts ?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-596) ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS

Reply via email to