[ 
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ] 
            
Doug Cook commented on NUTCH-365:
---------------------------------

It still seems to me that iterative normalization is useful and not risky. By 
definition, a "normalizer" is something which converts a URL to a "normal" 
form, and a URL in "normal" form should transform to itself. Thus a true 
"normalizer" should be stable. But I can see people wanting to do other 
transformations with normalizers, ones which perhaps shouldn't iterate. That's 
why there should be a configurable limit to the number of iterations, and those 
who want the current behavior can just set the limit to 1. Right now there is 
no good way, for example, to handle URLs with multiple session ID strings 
(rare, but extant!). Yes, one could manually repeat the pattern several times 
in the normalizer configuration, but this is hardly efficient. The second 
iteration of the same pattern should not be executed unless the first one 
matches.

Re: your comment about site-specific normalization, there is already some way 
to do this efficiently? By "efficiently," I mean having a pattern which applies 
only to site foo.com and is not examined for other sites. I know I can already 
(and do already) add general regexps which will only match for foo.com -- but 
these will be executed for all URLs, even if they only match for foo.com, and 
thus slow things down quite a bit if there are many of them. I was thinking 
something like having a hash table of sites with site-specific patterns, and 
then executing the given normalizations only for the given sites. That would 
allow us to efficiently build large tables of mirrors and other site-specific 
normalizations (for example, for session ID removals which would be unsafe in 
the general case). Thoughts? If there is already some easy way to do this you 
will make me a happy man!

> Flexible URL normalization
> --------------------------
>
>                 Key: NUTCH-365
>                 URL: http://issues.apache.org/jira/browse/NUTCH-365
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch.txt
>
>
> This patch is a heavily restructured version of the patch in NUTCH-253, so 
> much that I decided to create a separate issue. It changes the URL 
> normalization from a selectable single class to a flexible and context-aware 
> chain of normalization filters.
> Highlights:
> * rename all *UrlNormalizer* to *URLNormalizer* for consistency.
> * use a "chained filter" pattern for running several normalizers in sequence
> * the order in which normalizers are executed is defined by 
> "urlnormalizer.order" property, which lists space-separated implementation 
> classes. If there are more normalizers active than explicitly named on this 
> list, they will be run in random order after the ones specified on the list 
> are executed.
> * define a set of contexts (or scopes) in which normalizers may be called. 
> Each scope can have its own list of normalizers (via 
> "urlnormalizer.scope.<scope_name>" property) and its own order (via 
> "urlnormalizer.order.<scope_name>" property). If any of these properties are 
> missing, default settings are used.
> * each normalizer may further select among many configurations, depending on 
> the context in which it is called, using a modified API:
>    URLNormalizer.normalize(String url, String scope);
> * if a config for a given scope is not defined, then the default config will 
> be used.
> * several standard contexts / scopes have been defined, and various 
> applications have been modified to attempt using appropriate normalizer in 
> their context.
> * all JUnit tests have been modified, and run successfully.
> NUTCH-363 suggests to me that further changes may be required in this area, 
> perhaps we should combine urlfilters and urlnormalizers into a single 
> subsystem of url munging - now that we have support for scopes and flexible 
> combinations of normalizers we could turn URLFilters into a special case of 
> normalizers (or vice versa, depending on the point of view) ... 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to