Sebastian Nagel created NUTCH-1685:
--------------------------------------
Summary: URLUtil.toUNICODE fails on IDNs
Key: NUTCH-1685
URL: https://issues.apache.org/jira/browse/NUTCH-1685
Project: Nutch
Issue Type: Bug
Affects Versions: 2.2.1, 1.7
Environment: Java 7, OpenJDK 64-Bit, 1.7.0_25
Reporter: Sebastian Nagel
Fix For: 2.3, 1.8
URLUtil.toUNICODE() fails on IDNs and returns null instead of the Unicode URL.
The constructor of URI obviously does not accept IDN host names. For
{{http://www.xn--evir-zoa.com/}} the constructor IDN() throws the exception:
{code}
java.net.URISyntaxException: Illegal character in hostname at index 11:
http://www.çevir.com/
{code}
Principally, IDN.toUnicode() can convert URLs (not only domain or host names).
However, it does not convert URLs with host part consisting of only two parts:
{{http://xn--uni-tbingen-xhb.de/}}. Is that the reason why we need
URLUtil.toUNICODE() ?
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)