Simon Josefsson wrote:
"Mark Davis" <[EMAIL PROTECTED]> writes:
Implementations that claim conformance to Unicode 3.2 normalization may
not produce identical results in all cases, and may not produce *correct*
normalizations, because versions of UAX #15 prior to 4.1.0 have been
internally inconsistent.

We seem to disagree on this. I believe Unicode 3.2 was consistent. Only the non-normative sections was in conflict with the normative text. I admit an implementation would not meet some normalization invariants discussed in the document. But I don't believe the invariants were discussed as requirements on the implementation.

I read UAX #15 and PRI #29. It's quite unfortunate that such a mistake was made in the spec, and that several implementations have implemented that mistake so faithfully. Although I would normally feel that the IETF should just stick with the original normalization table and rules (to avoid DNS lookup failures or, heaven forbid, security breaches), in this case, it may be wiser to adopt the new UAX #15 rules, since the invariants are important to IDNA also. The idempotence invariant seems especially important.


I feel that we are still at the very beginning of the adoption of the particular Unicodes affected by this mistake. Most of them are for South Asian languages. Hangul is much further along, but not the particular Unicodes that are affected here (i.e. the Jamo). More importantly, this mistake only affects highly unusual, malformed data. I think that if IDNA decides not to follow Unicode's recommendation now or in the next couple of years, 10 or 20 years from now we would look back in time and regret this decision. If there is a time to break compatibility for something, it is now, for this.

The Korean IDN table at IANA does not contain the Jamo that are affected by this mistake. (They use the precomposed syllables, rather than the individual pieces.) I don't know anything about IDN in South Asia, but I doubt that any labels have been registered with this particular type of malformed data.

It is interesting that, in this case, Unicode seems to have implemented first and written the spec later, which is the way the IETF is supposed to do things too. It's just unfortunate that the Unicode spec was transcribed incorrectly from the implementation(s). On the other hand, IDNA seems to have done it in the opposite order. First, the spec was written, and now that we have deployed some implementations, we are finding serious problems with punctuation marks and symbols.

Erik



Reply via email to