host normalization in Generator$Selector 
-----------------------------------------

                 Key: NUTCH-387
                 URL: http://issues.apache.org/jira/browse/NUTCH-387
             Project: Nutch
          Issue Type: Bug
          Components: generator
         Environment: nutch trunk since revision 449088
            Reporter: Johannes Zillmann


the host normalization in Generator$Selector#reduce at line 177 seems broken:
String host = new URL(url.toString()).getHost();
...
try {
            host = normalizers.normalize(host, 
URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
            host = new URL(host).getHost().toLowerCase();
 } catch (Exception e) {
       LOG.warn("Malformed URL: '" + host + "', skipping");
 }

With default configuration the basic nomalizer will be called, which is doing 
'new URL(host)'.
Also in line below 'new URL(host)' will be called.
Since url.getHost() always return the host without protocol, there will be a 
MalformedUrlException be thrown, always.
The job will continue as usual though, cause the exception is catched.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to