[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13704505#comment-13704505
 ] 

Ferdy Galema commented on NUTCH-1457:
-------------------------------------

That seems like a nice solution, although there might be a problem with code 
that assumes STATUS_FETCHED, for example the ParserJob: It only processes 
STATUS_FETCHED entries. There may be more dependencies.

So maybe a small modification to your solution: Do not introduce a new crawl 
status, but add a field to the Markers that indicates an item is scheduled for 
a fetch. At update time this same field is checked to see if setFetchSchedule 
should be called. (The page Metadata could also be used, but it markers should 
be a bit faster because the map is generally smaller).

Ie:
//Generator:
page.putToMarkers(SCHEDULED, TRUE)

//Updater
if (TRUE.equals(page.getFromMarkers(SCHEDULED))) 
{page.removeFromMarkers(SCHEDULED); setFetchSchedule..} else{if over max then 
force etc..}

SCHEDULED is a constant new Utf8("scheduled") and TRUE is new Utf8("true"). Or 
instead of TRUE perhaps something more meaningful. (If needed).

                
> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1457
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1457
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.4
>
>         Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to