[
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704505#comment-13704505
]
Ferdy Galema commented on NUTCH-1457:
-------------------------------------
That seems like a nice solution, although there might be a problem with code
that assumes STATUS_FETCHED, for example the ParserJob: It only processes
STATUS_FETCHED entries. There may be more dependencies.
So maybe a small modification to your solution: Do not introduce a new crawl
status, but add a field to the Markers that indicates an item is scheduled for
a fetch. At update time this same field is checked to see if setFetchSchedule
should be called. (The page Metadata could also be used, but it markers should
be a bit faster because the map is generally smaller).
Ie:
//Generator:
page.putToMarkers(SCHEDULED, TRUE)
//Updater
if (TRUE.equals(page.getFromMarkers(SCHEDULED)))
{page.removeFromMarkers(SCHEDULED); setFetchSchedule..} else{if over max then
force etc..}
SCHEDULED is a constant new Utf8("scheduled") and TRUE is new Utf8("true"). Or
instead of TRUE perhaps something more meaningful. (If needed).
> Nutch2 Refactor the update process so that fetched items are only processed
> once
> --------------------------------------------------------------------------------
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
> Issue Type: Improvement
> Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java,
> GeneratorMapper.java, GeneratorReducer.java
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira