Author: ab Date: Tue Sep 18 12:07:39 2007 New Revision: 577018 URL: http://svn.apache.org/viewvc?rev=577018&view=rev Log: NUTCH-554 - Generator throws IOException on invalid urls.
Modified: lucene/nutch/trunk/CHANGES.txt lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?rev=577018&r1=577017&r2=577018&view=diff ============================================================================== --- lucene/nutch/trunk/CHANGES.txt (original) +++ lucene/nutch/trunk/CHANGES.txt Tue Sep 18 12:07:39 2007 @@ -133,6 +133,9 @@ 45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan) +46. NUTCH-554 - Generator throws IOException on invalid urls. + (Brian Whitman via ab) + Release 0.9 - 2007-04-02 1. Changed log4j confiquration to log to stdout on commandline Modified: lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java?rev=577018&r1=577017&r2=577018&view=diff ============================================================================== --- lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java Tue Sep 18 12:07:39 2007 @@ -184,7 +184,13 @@ Text url = entry.url; if (maxPerHost > 0) { // are we counting hosts? - URL u = new URL(url.toString()); + URL u = null; + try { + u = new URL(url.toString()); + } catch (MalformedURLException e) { + LOG.info("Bad protocol in url: " + url.toString()); + continue; + } String host = u.getHost(); if (host == null) { // unknown host, skip