[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241225#comment-13241225
 ] 

Markus Jelsma commented on NUTCH-1320:
--------------------------------------

Somewhere down the line IDN's enter the CrawlDB in ASCII so there is no problem 
there but these tools lack conversion. The filter and normalizer checker tools 
would also benefit. This also suggests the need of an IDNNormalizer that does 
toUnicode when indexing, you don't want http://xn--*/ URL's in your index.
                
> IndexChecker and ParseChecker choke on IDN's
> --------------------------------------------
>
>                 Key: NUTCH-1320
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1320
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.4
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
>         at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to