Donald Eastlake 3rd <[EMAIL PROTECTED]> wrote on the IETF list: > There is now a standard way to encode URIs containing arbitrary > UNICODE characters. This is described in RFC 3275 (which is > currently a Draft Standard), in Section 4.3.3.1, and in the > corresponding W3C document and has appeared in other W3C documents, > for exampe XML Base.
So U+00E1 LATIN SMALL LETTER A WITH ACUTE (�), which is 0xC3 0xA1 in UTF-8, is encoded as "%C3%A1" (six bytes) according to RFC 3275. All BMP characters above U+07FF, including all CJK characters, take three UTF-8 bytes and thus nine RFC 3275 bytes. I thought CJK users and others wanted *better* compression. (No, David, I know you're not all the same person. I heard lots of voices saying the same thing.) -Doug Ewell Fullerton, California
