Normalize Host during Generate
------------------------------

         Key: NUTCH-253
         URL: http://issues.apache.org/jira/browse/NUTCH-253
     Project: Nutch
        Type: New Feature

  Components: fetcher  
    Versions: 0.8-dev    
    Reporter: Rod Taylor


Extend URL Normalizer to allow for normalizion of the Hostname during Generate. 
By default no rules are applied.

In short, this allows foo.bar.com, bif.baz.bar.com and bar.com to be counted as 
being the same for generate.max.per.host if an appropriate regex is used.

Add "urlnormalizer-regex" to plugin.includes in nutch-site.xml in order to 
enable it.

Since several modules now extend the urlnormalizer base we use a "scope" 
parameter within plugin.xml to allow differentiation between the various 
urlnormalizer modules to select the right module for Generate.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to