[ https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-366: --------------------------------------- Fix Version/s: 2.2 1.7 > Merge URLFilters and URLNormalizers > ----------------------------------- > > Key: NUTCH-366 > URL: https://issues.apache.org/jira/browse/NUTCH-366 > Project: Nutch > Issue Type: Improvement > Reporter: Andrzej Bialecki > Labels: gsoc2012 > Fix For: 1.7, 2.2 > > > Currently Nutch uses two subsystems related to url validation and > normalization: > * URLFilter: this interface checks if URLs are valid for further processing. > Input URL is not changed in any way. The output is a boolean value. > * URLNormalizer: this interface brings URLs to their base ("normal") form, or > removes unneeded URL components, or performs any other URL mangling as > necessary. Input URLs are changed, and are returned as result. > However, various Nutch tools run filters and normalizers in pre-determined > order, i.e. normalizers first, and then filters. In some cases, where > normalizers are complex and running them is costly (e.g. numerous regex > rules, DNS lookups) it would make sense to run some of the filters first > (e.g. prefix-based filters that select only certain protocols, or > suffix-based filters that select only known "extensions"). This is currently > not possible - we always have to run normalizers, only to later throw away > urls because they failed to pass through filters. > I would like to solicit comments on the following two solutions, and work on > implementation of one of them: > 1) we could make URLFilters and URLNormalizers implement the same interface, > and basically make them interchangeable. This way users could configure their > order arbitrarily, even mixing filters and normalizers out of order. This is > more complicated, but gives much more flexibility - and NUTCH-365 already > provides sufficient framework to implement this, including the ability to > define different sequences for different steps in the workflow. > 2) we could use a property "url.mangling.order" ;) to define whether > normalizers or filters should run first. This is simple to implement, but > provides only limited improvement - because either all filters or all > normalizers would run, they couldn't be mixed in arbitrary order. > Any comments? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira