[ 
https://issues.apache.org/jira/browse/NUTCH-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080441#comment-18080441
 ] 

ASF GitHub Bot commented on NUTCH-3176:
---------------------------------------

sebastian-nagel opened a new pull request, #914:
URL: https://github.com/apache/nutch/pull/914

   - URLUtil:
     - make IDNA2008 the default for the methods toASCII and toUNICODE
     - provide methods to convert host names both for IDNA2003 and IDNA2008
   - urlnormalizer-basic:
     - convert host names using IDNA2008 if the property 
urlnormalizer.basic.host.idna2008 is true
   - refactor to share methods between URLUtil and urlnormalizer-basic
   




> URLUtil and urlnormalizer-basic: add support for IDNA2008
> ---------------------------------------------------------
>
>                 Key: NUTCH-3176
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3176
>             Project: Nutch
>          Issue Type: New Feature
>          Components: plugin, urlnormalizer, util
>    Affects Versions: 1.22
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.23
>
>
> IDNA2008, defined in [RFC 5890|https://www.rfc-editor.org/rfc/rfc5890], has 
> superceded IDNA2003 ([RFC 3490|https://www.rfc-editor.org/rfc/rfc3490]) in 
> 2008 (as the name suggests).
> When processing URLs and host names, IDNA2008 variants nowadays occur from 
> time to time, causing issues if they fail to be processed. Corresponding 
> Nutch tools, that is URLUtil and urlnormalizer-basic, should support IDNA2008.
> IDNA2008 allows Unicode characters from versions newer to Unicode 3.2. There 
> are also some deviations in the mapping between Unicode and ASCII. For 
> example {{straße.de}} is mapped to {{strasse.de}} by IDNA2003 (an 
> irreversible mapping), but to {{xn--strae-oqa.de}} by IDNA2008 (reversibel).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to