[jira] Commented: (NUTCH-365) Flexible URL normalization

Andrzej Bialecki (JIRA) Mon, 11 Sep 2006 09:38:05 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433890 ] 
            
Andrzej Bialecki  commented on NUTCH-365:
-----------------------------------------


Running several iterations of filters/normalizers may be risky ... We would 
have to ensure that match/replace expressions are stable, in the sense that 
running the same url twice or more times through the same pair of match/replace 
will still produce the same result.

Example: if I want to always remove one level of domains (i.e. www.example.com 
-> example.com; foo.bar.baz.com -> bar.baz.com), running these filters again 
would produce unwanted results.

Re: short-circuiting the evaluation loops: we would have to change the way we 
pass arguments, so that we can change or not change the urls, and still proceed 
with the loop if needed. This seems to be the key semantic difference between 
filters and normalizers. Filters are primarily in business of discarding urls, 
while normalizers only munge them but rarely cause them to be thrown away.

Re: per-site rules: you can already accomplish this. Just write a normalizer or 
filter which applies different rule-sets depending on the domain/host name.

> Flexible URL normalization
> --------------------------
>
>                 Key: NUTCH-365
>                 URL: http://issues.apache.org/jira/browse/NUTCH-365
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch.txt
>
>
> This patch is a heavily restructured version of the patch in NUTCH-253, so 
> much that I decided to create a separate issue. It changes the URL 
> normalization from a selectable single class to a flexible and context-aware 
> chain of normalization filters.
> Highlights:
> * rename all *UrlNormalizer* to *URLNormalizer* for consistency.
> * use a "chained filter" pattern for running several normalizers in sequence
> * the order in which normalizers are executed is defined by 
> "urlnormalizer.order" property, which lists space-separated implementation 
> classes. If there are more normalizers active than explicitly named on this 
> list, they will be run in random order after the ones specified on the list 
> are executed.
> * define a set of contexts (or scopes) in which normalizers may be called. 
> Each scope can have its own list of normalizers (via 
> "urlnormalizer.scope.<scope_name>" property) and its own order (via 
> "urlnormalizer.order.<scope_name>" property). If any of these properties are 
> missing, default settings are used.
> * each normalizer may further select among many configurations, depending on 
> the context in which it is called, using a modified API:
>    URLNormalizer.normalize(String url, String scope);
> * if a config for a given scope is not defined, then the default config will 
> be used.
> * several standard contexts / scopes have been defined, and various 
> applications have been modified to attempt using appropriate normalizer in 
> their context.
> * all JUnit tests have been modified, and run successfully.
> NUTCH-363 suggests to me that further changes may be required in this area, 
> perhaps we should combine urlfilters and urlnormalizers into a single 
> subsystem of url munging - now that we have support for scopes and flexible 
> combinations of normalizers we could turn URLFilters into a special case of 
> normalizers (or vice versa, depending on the point of view) ... 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-365) Flexible URL normalization

Reply via email to