host normalization in Generator$Selector
-----------------------------------------
Key: NUTCH-387
URL: http://issues.apache.org/jira/browse/NUTCH-387
Project: Nutch
Issue Type: Bug
Components: generator
Environment: nutch trunk since revision 449088
Reporter: Johannes Zillmann
the host normalization in Generator$Selector#reduce at line 177 seems broken:
String host = new URL(url.toString()).getHost();
...
try {
host = normalizers.normalize(host,
URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
host = new URL(host).getHost().toLowerCase();
} catch (Exception e) {
LOG.warn("Malformed URL: '" + host + "', skipping");
}
With default configuration the basic nomalizer will be called, which is doing
'new URL(host)'.
Also in line below 'new URL(host)' will be called.
Since url.getHost() always return the host without protocol, there will be a
MalformedUrlException be thrown, always.
The job will continue as usual though, cause the exception is catched.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira