Sebastian Nagel created NUTCH-1685:
--------------------------------------

             Summary: URLUtil.toUNICODE fails on IDNs
                 Key: NUTCH-1685
                 URL: https://issues.apache.org/jira/browse/NUTCH-1685
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 2.2.1, 1.7
         Environment: Java 7, OpenJDK 64-Bit, 1.7.0_25
            Reporter: Sebastian Nagel
             Fix For: 2.3, 1.8


URLUtil.toUNICODE() fails on IDNs and returns null instead of the Unicode URL. 
The constructor of URI obviously does not accept IDN host names. For 
{{http://www.xn--evir-zoa.com/}} the constructor IDN() throws the exception:
{code}
java.net.URISyntaxException: Illegal character in hostname at index 11: 
http://www.çevir.com/
{code}

Principally, IDN.toUnicode() can convert URLs (not only domain or host names). 
However, it does not convert URLs with host part consisting of only two parts: 
{{http://xn--uni-tbingen-xhb.de/}}. Is that the reason why we need 
URLUtil.toUNICODE() ?



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to