[ http://issues.apache.org/jira/browse/NUTCH-382?page=all ]
Jim Kellerman updated NUTCH-382:
--------------------------------
Attachment: patch.txt
Patch to fix this issue.
> Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled
> ----------------------------------------------------------------------------
>
> Key: NUTCH-382
> URL: http://issues.apache.org/jira/browse/NUTCH-382
> Project: Nutch
> Issue Type: Bug
> Components: generator
> Affects Versions: 0.9.0
> Reporter: Jim Kellerman
> Attachments: patch.txt
>
>
> The fix for NUTCH-365 in org.apache.nutch.crawl.Generator.java (revision
> 449088) introduced a bug in which if generate.max.per.host.by.ip is enabled,
> the error message "WARN crawl.Generator (Generator.java:reduce(181)) -
> Malformed URL: '38.99.15.82', skipping". The message varies according to the
> host IP.
> This is because the hostname has already been converted to its IP address,
> and the code:
> host = normalizers.normalize(host,
> URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> will not normalize an IP address. What is needed to fix this this problem is
> to include the code inserted in revision 449088 inside an else block so that
> this code is not executed if generate.max.per.host.by.ip is enabled.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira