Dan asked: > Kenneth Whistler writes: > > The *real* problem is guaranteeing interoperability for UTF-8, UTF-16, > > and UTF-32, which are the three sanctioned encoding forms of Unicode > > The obvious choice for Internet protocols is UTF-8. See RFC 2277.
No argument there. > Systems that use 16-bit encodings internally, such as Windows, handle > UTF-8 conversions at the boundary between the system and the network. > > What's the problem? What I am referring to is code point range interoperability for the 3 encoding forms of Unicode. The Unicode Standard is tied to ISO/IEC 10646, which is architecturally a 31-bit character encoding standard, by which I mean that its nominal code space is 0..0x7FFFFFFF. The Unicode Standard formally limits that code space, however, to a 21-bit range, namely 0..0x10FFFF. The reason for that is that UTF-16 can only address that range. In the Unicode Standard, then, UTF-32 is also constrainted to 0..0x10FFFF, and UTF-8 is constrained to four-byte forms up to <F4 8F BF BF> (i.e. U+10FFFF). *That* is the guarantee of interoperability, since it means that any valid value in UTF-8 can be accurately converted to either of the two other forms, and vice versa. *If* 10646 were ever to encode a character at a code point beyond 0x10FFFF, *then* there would be an interoperability problem. And that is why 10646 has been amended recently to retrofit the same constraints on allowable ranges for encoding as specified in the Unicode Standard. That is everyone's guarantee that neither SC2/WG2 nor the Unicode Consortium is going to encode a character that breaks encoding form interoperability, no matter which of the three forms (or combinations thereof) you are using for an implementation. The reason I brought this up at all was to head off the zany garden-path discussions about "UTF-128" and extending UTF-8 because putatively there might not be enough code points at some unspecified time in the future. Let the character encoding committees deal with that issue. In the meantime, the IETF (and this IDN WG) have the Unicode Standard and the IS 10646, with their standard encoding forms -- just use them, with the guarantee of interoperability that the relevant committees are providing, and don't hare off into discussions about their supposed inadequacy or limitations. > Converting between UTF-8 and UTF-16 and UTF-32 doesn't cause IDNA-style > interoperability failures. Correct. Because the UTC has limited the code space range to ensure that interoperability. > Of course, Windows still has all sorts of problems related to its old > ``code pages,'' and there are many similar problems with old character > encodings under UNIX. The use of more than one 8-bit ASCII extension > provides ample opportunity for IDNA-style interoperability failures. > UTF-8 is a way out of this mess. Actually, Unicode is a way out of this mess. And then what particular encoding form(s) you choose depend on the requirements of the protocol or application you are designing and implementing. --Ken
