Normalize Host during Generate
------------------------------
Key: NUTCH-253
URL: http://issues.apache.org/jira/browse/NUTCH-253
Project: Nutch
Type: New Feature
Components: fetcher
Versions: 0.8-dev
Reporter: Rod Taylor
Extend URL Normalizer to allow for normalizion of the Hostname during Generate.
By default no rules are applied.
In short, this allows foo.bar.com, bif.baz.bar.com and bar.com to be counted as
being the same for generate.max.per.host if an appropriate regex is used.
Add "urlnormalizer-regex" to plugin.includes in nutch-site.xml in order to
enable it.
Since several modules now extend the urlnormalizer base we use a "scope"
parameter within plugin.xml to allow differentiation between the various
urlnormalizer modules to select the right module for Generate.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira