[ 
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-366:
---------------------------------------

    Fix Version/s: 2.2
                   1.7
    
> Merge URLFilters and URLNormalizers
> -----------------------------------
>
>                 Key: NUTCH-366
>                 URL: https://issues.apache.org/jira/browse/NUTCH-366
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Andrzej Bialecki 
>              Labels: gsoc2012
>             Fix For: 1.7, 2.2
>
>
> Currently Nutch uses two subsystems related to url validation and 
> normalization:
> * URLFilter: this interface checks if URLs are valid for further processing. 
> Input URL is not changed in any way. The output is a boolean value.
> * URLNormalizer: this interface brings URLs to their base ("normal") form, or 
> removes unneeded URL components, or performs any other URL mangling as 
> necessary. Input URLs are changed, and are returned as result.
> However, various Nutch tools run filters and normalizers in pre-determined 
> order, i.e. normalizers first, and then filters. In some cases, where 
> normalizers are complex and running them is costly (e.g. numerous regex 
> rules, DNS lookups) it would make sense to run some of the filters first 
> (e.g. prefix-based filters that select only certain protocols, or 
> suffix-based filters that select only known "extensions"). This is currently 
> not possible - we always have to run normalizers, only to later throw away 
> urls because they failed to pass through filters.
> I would like to solicit comments on the following two solutions, and work on 
> implementation of one of them:
> 1) we could make URLFilters and URLNormalizers implement the same interface, 
> and basically make them interchangeable. This way users could configure their 
> order arbitrarily, even mixing filters and normalizers out of order. This is 
> more complicated, but gives much more flexibility - and NUTCH-365 already 
> provides sufficient framework to implement this, including the ability to 
> define different sequences for different steps in the workflow.
> 2) we could use a property "url.mangling.order" ;) to define whether 
> normalizers or filters should run first. This is simple to implement, but 
> provides only limited improvement - because either all filters or all 
> normalizers would run, they couldn't be mixed in arbitrary order.
> Any comments?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to