[ http://issues.apache.org/jira/browse/NUTCH-387?page=comments#action_12443742 ] Otis Gospodnetic commented on NUTCH-387: ----------------------------------------
This indeed looks wrong. My guess is that the new URL(....) line just needs to be removed, but I'm not sure, so I'll let somebody else make the actual change. > host normalization in Generator$Selector > ---------------------------------------- > > Key: NUTCH-387 > URL: http://issues.apache.org/jira/browse/NUTCH-387 > Project: Nutch > Issue Type: Bug > Components: generator > Environment: nutch trunk since revision 449088 > Reporter: Johannes Zillmann > > the host normalization in Generator$Selector#reduce at line 177 seems broken: > String host = new URL(url.toString()).getHost(); > ... > try { > host = normalizers.normalize(host, > URLNormalizers.SCOPE_GENERATE_HOST_COUNT); > host = new URL(host).getHost().toLowerCase(); > } catch (Exception e) { > LOG.warn("Malformed URL: '" + host + "', skipping"); > } > With default configuration the basic nomalizer will be called, which is doing > 'new URL(host)'. > Also in line below 'new URL(host)' will be called. > Since url.getHost() always return the host without protocol, there will be a > MalformedUrlException be thrown, always. > The job will continue as usual though, cause the exception is catched. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
