Looks good to me. Other than your interpretation of RFC 3490 leading to the insertion of 0x2E into a DNS label, but I guess you and I will simply have to agree that we disagree on this point. RFC 3490 should have been clearer. By the way, I did a Web search for "2024 nfkc" and found that this issue was raised, but I guess it was not resolved adequately:
http://www.ops.ietf.org/lists/idn/idn.2001/msg02450.html Erik On Jan 15, 2008 7:15 AM, Simon Josefsson <[EMAIL PROTECTED]> wrote: > "Erik van der Poel" <[EMAIL PROTECTED]> writes: > > > Yes, that's right. > > > > By the way, there may be a different way to address this issue. If > > libidn has a separate API for NFKC or Nameprep, the caller could pass > > the entire domain name (including all of the dots and dot-like > > characters) through NFKC (or Nameprep) first, and then call the normal > > IDNA routine. This is quite likely to behave the same way as MSIE 7 > > and Firefox 2. If you chose this approach, you could simply document > > this somewhere, and callers could then decide whether or not to go > > this way. > > Libidn has a simple NFKC interface, and I'm documenting that approach > now. Below is the current text in the manual. I'll forward this to the > Firefox IDN guys to see if they are interested in documenting their > practice further, possibly in an I-D. If ToASCII(NFKC(i)) turns out to > actually work and behave better than RFC 3490, documenting that now > seems useful. > > Thanks, > /Simon > > Appendix B On Label Separators > ****************************** > > Some strings contains characters whose NFKC normalized form contain the > ASCII dot (0x2E, "."). Examples of these characters are U+2024 (ONE > DOT LEADER) and U+248C (DIGIT FIVE FULL STOP). The strings have the > interesting property that their IDNA ToASCII output will contain > embedded dots. For example: > > ToASCII (hi U+248C com) = hi5.com > ToASCII (räksmörgås U+2024 com) = xn--rksmrgs.com-l8as9u > > This demonstrate the two general cases: The first where the ASCII dot > is part of an output that do not begin with the IDN prefix "xn-". The > second example illustrate when the dot is part of IDN prefixed with > "xn-". > > The input strings are, from the DNS point of view, a single label. > The IDNA algorithm translate one label at a time. Thus, the output is > expected to be only one label. What is important here is to make sure > the DNS resolver receives the correct query. The DNS protocol does not > use the dot to delimit labels on the wire, rather it uses length-value > pairs. Thus the correct query would be for `{7}hi5.com' and > `{22}xn--rksmrgs.com-l8as9u' respectively. > > Some implementations (1) have decided that these inputs strings are > potentially confusing for the user. The string "hi U+248C com" looks > like "hi5.com" on systems that support Unicode properly. These > implementations do not follow RFC 3490. They yield: > > ToASCII (hi U+248C com) = hi5.com > ToASCII (räksmörgås U+2024 com) = xn--rksmrgs-5wao1o.com > > The DNS query they perform are `{3}hi5{3}com' and > `{18}xn--rksmrgs-5wao1o{3}com' respectively. Arguably, this leads to a > better user experience, and suggests that the IDNA specification is > sub-optimal in this area. > > B.1 Recommended Workaround > ========================== > > It has been suggested to normalize the entire input string using NFKC > before passing it to IDNA ToASCII. You may use > `stringprep_utf8_nfkc_normalize' or `stringprep_ucs4_nfkc_normalize'. > This will avoid the problem, and appears to lead to similar behaviour > as IE/Firefox. > > Alternative workarounds are being considered. Eventually Libidn may > implement a new flag to the `idna_*' functions that implements a > recommended way to work around this problem. > > ---------- Footnotes ---------- > > (1) Notably Microsoft's Internet Explorer and Mozilla's Firefox, but > not Apple's Safari. > _______________________________________________ Help-libidn mailing list [email protected] http://lists.gnu.org/mailman/listinfo/help-libidn
