[ https://issues.apache.org/jira/browse/NUTCH-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748869#comment-16748869 ]
ASF GitHub Bot commented on NUTCH-2598: --------------------------------------- sebastian-nagel commented on pull request #435: NUTCH-2598 URLNormalizerChecker fails on invalid URLs in input URL: https://github.com/apache/nutch/pull/435 Output empty string for invalid URLs and do not exit. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > URLNormalizerChecker fails on invalid URLs in input > --------------------------------------------------- > > Key: NUTCH-2598 > URL: https://issues.apache.org/jira/browse/NUTCH-2598 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.14 > Reporter: Sebastian Nagel > Priority: Minor > Fix For: 1.16 > > > I use the URLNormalizerChecker (urlnormalizer-regex and urlnormalizer-basic) > to normalize URLs before further processing them. If one of the used > normalizers throws a MalformedURLException when the > URLNormalizer.normalize(...) method is called, this isn't caught and causes > the checker to exit: > {noformat} > Exception in thread "main" java.net.MalformedURLException: For input string: > "???120810002" > at java.net.URL.<init>(URL.java:627) > at java.net.URL.<init>(URL.java:490) > at java.net.URL.<init>(URL.java:439) > at > org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:100) > at > org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:319) > at > org.apache.nutch.net.URLNormalizerChecker.process(URLNormalizerChecker.java:75) > at > org.apache.nutch.util.AbstractChecker.processStdin(AbstractChecker.java:97) > at org.apache.nutch.util.AbstractChecker.run(AbstractChecker.java:77) > at > org.apache.nutch.net.URLNormalizerChecker.run(URLNormalizerChecker.java:71) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:80) > Caused by: java.lang.NumberFormatException: For input string: "???120810002" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:580) > at java.lang.Integer.parseInt(Integer.java:615) > at java.net.URLStreamHandler.parseURL(URLStreamHandler.java:222) > at java.net.URL.<init>(URL.java:622) > ... 10 more > {noformat} > The URLNormalizer interface declares the MalformedURLException, it should be > caught in the normalizer checker: > - log the error > - return/output empty string -- This message was sent by Atlassian JIRA (v7.6.3#76005)