[ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew McCall updated NUTCH-650: -------------------------------- Attachment: malformedurl.patch Ran across a MalformedUrlException when running GeneratorHbase Stacktrace: java.net.MalformedURLException: no protocol: http?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=://answers.yahoo.com/Regional/Regions/Middle_East/Cultures___Community&grp_user=0 at java.net.URL.<init>(URL.java:567) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:88) at org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286) at org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:135) at org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:108) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332) at org.apache.hadoop.mapred.Child.main(Child.java:155) The problem lies with URLs in links with no file and no / between the host and the querystring. e.g. http://answers.yahoo.com?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=/Regional/Regions/Middle_East/Cultures___Community&grp_user=0 The first / in the grp_cat field gets interpreted as the beginning of the file. Attached patch solves the problem by ensuring / is added after host:port if it doesn't exist. Also includes tests and updates the build.xml to run the tests in org.apache.nutch*/** instead of just org.apache.nutch/** > Hbase Integration > ----------------- > > Key: NUTCH-650 > URL: https://issues.apache.org/jira/browse/NUTCH-650 > Project: Nutch > Issue Type: New Feature > Affects Versions: 1.0.0 > Reporter: Doğacan Güney > Assignee: Doğacan Güney > Fix For: 1.1 > > Attachments: hbase-integration_v1.patch, hbase_v2.patch, > malformedurl.patch, nofollow-hbase.patch, nutch-habase.patch > > > This issue will track nutch/hbase integration -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.