[
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-366:
----------------------------------
Fix Version/s: 1.8
> Merge URLFilters and URLNormalizers
> -----------------------------------
>
> Key: NUTCH-366
> URL: https://issues.apache.org/jira/browse/NUTCH-366
> Project: Nutch
> Issue Type: Improvement
> Reporter: Andrzej Bialecki
> Labels: gsoc2012
> Fix For: 2.3, 1.8
>
>
> Currently Nutch uses two subsystems related to url validation and
> normalization:
> * URLFilter: this interface checks if URLs are valid for further processing.
> Input URL is not changed in any way. The output is a boolean value.
> * URLNormalizer: this interface brings URLs to their base ("normal") form, or
> removes unneeded URL components, or performs any other URL mangling as
> necessary. Input URLs are changed, and are returned as result.
> However, various Nutch tools run filters and normalizers in pre-determined
> order, i.e. normalizers first, and then filters. In some cases, where
> normalizers are complex and running them is costly (e.g. numerous regex
> rules, DNS lookups) it would make sense to run some of the filters first
> (e.g. prefix-based filters that select only certain protocols, or
> suffix-based filters that select only known "extensions"). This is currently
> not possible - we always have to run normalizers, only to later throw away
> urls because they failed to pass through filters.
> I would like to solicit comments on the following two solutions, and work on
> implementation of one of them:
> 1) we could make URLFilters and URLNormalizers implement the same interface,
> and basically make them interchangeable. This way users could configure their
> order arbitrarily, even mixing filters and normalizers out of order. This is
> more complicated, but gives much more flexibility - and NUTCH-365 already
> provides sufficient framework to implement this, including the ability to
> define different sequences for different steps in the workflow.
> 2) we could use a property "url.mangling.order" ;) to define whether
> normalizers or filters should run first. This is simple to implement, but
> provides only limited improvement - because either all filters or all
> normalizers would run, they couldn't be mixed in arbitrary order.
> Any comments?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira