[
https://issues.apache.org/jira/browse/NUTCH-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080441#comment-18080441
]
ASF GitHub Bot commented on NUTCH-3176:
---------------------------------------
sebastian-nagel opened a new pull request, #914:
URL: https://github.com/apache/nutch/pull/914
- URLUtil:
- make IDNA2008 the default for the methods toASCII and toUNICODE
- provide methods to convert host names both for IDNA2003 and IDNA2008
- urlnormalizer-basic:
- convert host names using IDNA2008 if the property
urlnormalizer.basic.host.idna2008 is true
- refactor to share methods between URLUtil and urlnormalizer-basic
> URLUtil and urlnormalizer-basic: add support for IDNA2008
> ---------------------------------------------------------
>
> Key: NUTCH-3176
> URL: https://issues.apache.org/jira/browse/NUTCH-3176
> Project: Nutch
> Issue Type: New Feature
> Components: plugin, urlnormalizer, util
> Affects Versions: 1.22
> Reporter: Sebastian Nagel
> Priority: Major
> Fix For: 1.23
>
>
> IDNA2008, defined in [RFC 5890|https://www.rfc-editor.org/rfc/rfc5890], has
> superceded IDNA2003 ([RFC 3490|https://www.rfc-editor.org/rfc/rfc3490]) in
> 2008 (as the name suggests).
> When processing URLs and host names, IDNA2008 variants nowadays occur from
> time to time, causing issues if they fail to be processed. Corresponding
> Nutch tools, that is URLUtil and urlnormalizer-basic, should support IDNA2008.
> IDNA2008 allows Unicode characters from versions newer to Unicode 3.2. There
> are also some deviations in the mapping between Unicode and ASCII. For
> example {{straße.de}} is mapped to {{strasse.de}} by IDNA2003 (an
> irreversible mapping), but to {{xn--strae-oqa.de}} by IDNA2008 (reversibel).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)