[ 
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435185 ] 
            
Andrzej Bialecki  commented on NUTCH-365:
-----------------------------------------

Lowercasing is done here because we can't rely on each normalizer to do it, and 
having uniform host names is important here.

> Flexible URL normalization
> --------------------------
>
>                 Key: NUTCH-365
>                 URL: http://issues.apache.org/jira/browse/NUTCH-365
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch.txt
>
>
> This patch is a heavily restructured version of the patch in NUTCH-253, so 
> much that I decided to create a separate issue. It changes the URL 
> normalization from a selectable single class to a flexible and context-aware 
> chain of normalization filters.
> Highlights:
> * rename all *UrlNormalizer* to *URLNormalizer* for consistency.
> * use a "chained filter" pattern for running several normalizers in sequence
> * the order in which normalizers are executed is defined by 
> "urlnormalizer.order" property, which lists space-separated implementation 
> classes. If there are more normalizers active than explicitly named on this 
> list, they will be run in random order after the ones specified on the list 
> are executed.
> * define a set of contexts (or scopes) in which normalizers may be called. 
> Each scope can have its own list of normalizers (via 
> "urlnormalizer.scope.<scope_name>" property) and its own order (via 
> "urlnormalizer.order.<scope_name>" property). If any of these properties are 
> missing, default settings are used.
> * each normalizer may further select among many configurations, depending on 
> the context in which it is called, using a modified API:
>    URLNormalizer.normalize(String url, String scope);
> * if a config for a given scope is not defined, then the default config will 
> be used.
> * several standard contexts / scopes have been defined, and various 
> applications have been modified to attempt using appropriate normalizer in 
> their context.
> * all JUnit tests have been modified, and run successfully.
> NUTCH-363 suggests to me that further changes may be required in this area, 
> perhaps we should combine urlfilters and urlnormalizers into a single 
> subsystem of url munging - now that we have support for scopes and flexible 
> combinations of normalizers we could turn URLFilters into a special case of 
> normalizers (or vice versa, depending on the point of view) ... 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to