[ https://issues.apache.org/jira/browse/NUTCH-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555890#action_12555890 ]
Emmanuel Joke commented on NUTCH-596: ------------------------------------- I agree with you the proper solution will be the third one. However i don't like the idea to have time and processor consuming for record that we are not inetrested in. What do you think ? > ParseSegments parse content even if its not CrawlDatum.STATUS_FETCH_SUCCESS > --------------------------------------------------------------------------- > > Key: NUTCH-596 > URL: https://issues.apache.org/jira/browse/NUTCH-596 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.9.0 > Reporter: Emmanuel Joke > > We have 2 choices to parse the content either within the Fetcher class or > with the ParseSegment class > Fetcher(1 or 2) will check first if the CrawlDatum == STATUS_FETCH_SUCCESS > nad if its true it will parse the content. > However we don't have this check in ParseSegment, thus we parse every content > store on the disk without checking the Status. > So i think we should implement this check, i can see only 3 solutions: > - read the status code in the Metadata of the Content object > - don't store content for fetch with a crawldatun <> STATUS_FETCH_SUCCESS > - load the crawldatum object in ParseSegement > What are your thoughts ? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.