[ 
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew McCall updated NUTCH-650:
--------------------------------

    Attachment: malformedurl.patch

Ran across a MalformedUrlException when running GeneratorHbase

Stacktrace: 

java.net.MalformedURLException: no protocol: 
http?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=://answers.yahoo.com/Regional/Regions/Middle_East/Cultures___Community&grp_user=0
        at java.net.URL.<init>(URL.java:567)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
        at 
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer.normalize(BasicURLNormalizer.java:88)
        at 
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
        at 
org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:135)
        at 
org.apache.nutchbase.crawl.GeneratorHbase$GeneratorMapReduce.map(GeneratorHbase.java:108)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child.main(Child.java:155)

The problem lies with URLs in links with no file and no / between the host and 
the querystring.  

e.g. 
http://answers.yahoo.com?grp_name=MideastWebDialog&grp_spid=1600667023&grp_cat=/Regional/Regions/Middle_East/Cultures___Community&grp_user=0

The first / in the grp_cat field gets interpreted as the beginning of the file. 

Attached patch solves the problem by ensuring / is added after host:port if it 
doesn't exist. Also includes tests and updates the build.xml to run the tests 
in org.apache.nutch*/** instead of just org.apache.nutch/**




> Hbase Integration
> -----------------
>
>                 Key: NUTCH-650
>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.1
>
>         Attachments: hbase-integration_v1.patch, hbase_v2.patch, 
> malformedurl.patch, nofollow-hbase.patch, nutch-habase.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to