Kenneth Whistler writes: > The *real* problem is guaranteeing interoperability for UTF-8, UTF-16, > and UTF-32, which are the three sanctioned encoding forms of Unicode
The obvious choice for Internet protocols is UTF-8. See RFC 2277. Systems that use 16-bit encodings internally, such as Windows, handle UTF-8 conversions at the boundary between the system and the network. What's the problem? Converting between UTF-8 and UTF-16 and UTF-32 doesn't cause IDNA-style interoperability failures. It's crystal clear which pieces of text are 8-bit and which are 16-bit. Nobody says ridiculous IDNA-type things like ``you should think about converting that to 8 bits if you think it might be displayed, but definitely leave it as 16 bits if you think another program will look at it.'' Each interface makes a clear size choice. Of course, Windows still has all sorts of problems related to its old ``code pages,'' and there are many similar problems with old character encodings under UNIX. The use of more than one 8-bit ASCII extension provides ample opportunity for IDNA-style interoperability failures. UTF-8 is a way out of this mess. ---D. J. Bernstein, Associate Professor, Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago
