Sebastian Nagel created NUTCH-3176:
--------------------------------------
Summary: URLUtil and urlnormalizer-basic: add support for IDNA2008
Key: NUTCH-3176
URL: https://issues.apache.org/jira/browse/NUTCH-3176
Project: Nutch
Issue Type: New Feature
Components: plugin, urlnormalizer, util
Affects Versions: 1.22
Reporter: Sebastian Nagel
Fix For: 1.23
IDNA2008, defined in [RFC 5890|https://www.rfc-editor.org/rfc/rfc5890], has
superceded IDNA2003 ([RFC 3490|https://www.rfc-editor.org/rfc/rfc3490]) in 2008
(as the name suggests).
When processing URLs and host names, IDNA2008 variants nowadays occur from time
to time, causing issues if they fail to be processed. Corresponding Nutch
tools, that is URLUtil and urlnormalizer-basic, should support IDNA2008.
IDNA2008 allows Unicode characters from versions newer to Unicode 3.2. There
are also some deviations in the mapping between Unicode and ASCII. For example
{{straße.de}} is mapped to {{strasse.de}} by IDNA2003 (an irreversible
mapping), but to {{xn--strae-oqa.de}} by IDNA2008 (reversibel).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)