[
https://issues.apache.org/jira/browse/NUTCH-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15815806#comment-15815806
]
Sebastian Nagel commented on NUTCH-2334:
----------------------------------------
If it's only about deciding whether a page is (re)fetched or not - this is
possible using scoring filters:
- a page is only "generated" (added to the fetch list) if the value returned by
{{generatorSortValue}} is not below {{generate.min.score}}
- scoring filters can be stacked: the value returned by one filter is passed to
the next filter as argument of {{generatorSortValue}}
In this point you can think of ScoringFilter as a relevance-based scheduling
while the FetchSchedule interface is time-based. For large crawls you need both
approaches because it's always about sampling, there are far more URLs than you
are able to fetch.
> Extension point for schedulers
> ------------------------------
>
> Key: NUTCH-2334
> URL: https://issues.apache.org/jira/browse/NUTCH-2334
> Project: Nutch
> Issue Type: New Feature
> Components: generator
> Affects Versions: 1.12
> Reporter: Roannel Fernández Hernández
> Priority: Minor
> Fix For: 1.13
>
>
> With an extension point for schedulers, the users should be able to create
> new schedulers that meet to their own needs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)