[ https://issues.apache.org/jira/browse/NUTCH-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528658 ]
Hudson commented on NUTCH-554: ------------------------------ Integrated in Nutch-Nightly #211 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/211/]) > Generator throws java.io.IOException and dies on injected urls with no > protocol > -------------------------------------------------------------------------------- > > Key: NUTCH-554 > URL: https://issues.apache.org/jira/browse/NUTCH-554 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 1.0.0 > Environment: Linux(debian) Java 1.6 > Reporter: Brian Whitman > Assignee: Andrzej Bialecki > Fix For: 1.0.0 > > Attachments: genpatch.diff > > > On trunk nutch, injecting URLs with no protocol (like issues.apache.org/jira/ > vs. https://issues.apache.org/jira/) causes the generator to fail with an > IOException: > java.net.MalformedURLException: no protocol: www.variogr.am > at java.net.URL.<init>(URL.java:567) > at java.net.URL.<init>(URL.java:464) > at java.net.URL.<init>(URL.java:413) > at > org.apache.nutch.crawl.Generator$Selector.reduce(Generator.java:187) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:326) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:155) > 2007-09-15 11:11:26,986 FATAL crawl.Generator - Generator: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.crawl.Generator.generate(Generator.java:416) > at org.apache.nutch.crawl.Generator.run(Generator.java:557) > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > at org.apache.nutch.crawl.Generator.main(Generator.java:520) > To test: > # cat test/urls.txt > www.variogr.am > http://www.variogr.am/ > # bin/nutch inject testcrawl/crawldb test/ > (this goes fine) > # bin/nutch generate testcrawl/crawldb testcrawl/segments -topN 10 > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: testcrawl/segments/20070915111125 > Generator: filtering: true > Generator: topN: 10 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: java.io.IOException: Job failed! > > This issue did not exist in earlier versions of nutch -- it would ignore the > malformed URL without crashing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.