[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435185 ] Andrzej Bialecki commented on NUTCH-365: -----------------------------------------
Lowercasing is done here because we can't rely on each normalizer to do it, and having uniform host names is important here. > Flexible URL normalization > -------------------------- > > Key: NUTCH-365 > URL: http://issues.apache.org/jira/browse/NUTCH-365 > Project: Nutch > Issue Type: Improvement > Affects Versions: 0.9.0 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Fix For: 0.9.0 > > Attachments: patch.txt > > > This patch is a heavily restructured version of the patch in NUTCH-253, so > much that I decided to create a separate issue. It changes the URL > normalization from a selectable single class to a flexible and context-aware > chain of normalization filters. > Highlights: > * rename all *UrlNormalizer* to *URLNormalizer* for consistency. > * use a "chained filter" pattern for running several normalizers in sequence > * the order in which normalizers are executed is defined by > "urlnormalizer.order" property, which lists space-separated implementation > classes. If there are more normalizers active than explicitly named on this > list, they will be run in random order after the ones specified on the list > are executed. > * define a set of contexts (or scopes) in which normalizers may be called. > Each scope can have its own list of normalizers (via > "urlnormalizer.scope.<scope_name>" property) and its own order (via > "urlnormalizer.order.<scope_name>" property). If any of these properties are > missing, default settings are used. > * each normalizer may further select among many configurations, depending on > the context in which it is called, using a modified API: > URLNormalizer.normalize(String url, String scope); > * if a config for a given scope is not defined, then the default config will > be used. > * several standard contexts / scopes have been defined, and various > applications have been modified to attempt using appropriate normalizer in > their context. > * all JUnit tests have been modified, and run successfully. > NUTCH-363 suggests to me that further changes may be required in this area, > perhaps we should combine urlfilters and urlnormalizers into a single > subsystem of url munging - now that we have support for scopes and flexible > combinations of normalizers we could turn URLFilters into a special case of > normalizers (or vice versa, depending on the point of view) ... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers