Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled
----------------------------------------------------------------------------
Key: NUTCH-382
URL: http://issues.apache.org/jira/browse/NUTCH-382
Project: Nutch
Issue Type: Bug
Components: generator
Affects Versions: 0.9.0
Reporter: Jim Kellerman
The fix for NUTCH-365 in org.apache.nutch.crawl.Generator.java (revision
449088) introduced a bug in which if generate.max.per.host.by.ip is enabled,
the error message "WARN crawl.Generator (Generator.java:reduce(181)) -
Malformed URL: '38.99.15.82', skipping". The message varies according to the
host IP.
This is because the hostname has already been converted to its IP address, and
the code:
host = normalizers.normalize(host,
URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
will not normalize an IP address. What is needed to fix this this problem is to
include the code inserted in revision 449088 inside an else block so that this
code is not executed if generate.max.per.host.by.ip is enabled.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira