[ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702326#action_12702326
 ] 

Julien Nioche commented on NUTCH-477:
-------------------------------------

Having a scope for the URL filters could be useful in cases where we want to do 
a focused crawl. If for instance we want to parse a limited number of domains 
we could have different filters to use in ParseOutputFormat (so that we keep 
some of the outgoing links using the usual prefix and suffix filters for 
instance) and in CrawlDBFilter so that we keep only the URLs matching our 
limited set of domains.

Another way of doing would be to have a different set of filters for the 
Generation to fetch only within the domains of interest but keep all URLs in 
the crawlDB. 

Of course we can have custom scorers to give a low score to URLS we don't want 
to fetch and set a threshold in the Generation, but IMHO being able to do that 
with the filters would be more elegant

> Extend URLFilters to support different filtering chains
> -------------------------------------------------------
>
>                 Key: NUTCH-477
>                 URL: https://issues.apache.org/jira/browse/NUTCH-477
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: urlfilters.patch
>
>
> I propose to make the following changes to URLFilters:
> * extend URLFilters so that they support different filtering rules depending 
> on the context where they are executed. This functionality mirrors the one 
> that URLNormalizers already support.
> * change their return value to an int code, in order to support early 
> termination of long filtering chains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to