On Tue, 06 Feb 2007 00:25:45 +0900, Chris Kuklewicz <[EMAIL PROTECTED]> wrote: >> UTF-8 also uses 4 to 6 byte encodings now. >> CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol, >> etc ... use 4 byte encoding. > > Looking at several sources, it seems you are incorrect. > > Haskell Char go up to Unicode 1114111 (decimal) or 0x10ffff Hexidecimal). > These are encoded by UTF-8 in 1,2,3,or 4 bytes.
I see. I'm confused Unicode support with Charset support. I'm sorry about it. UCS-4 can support greater than 1114111 code pages. So if we want to support full UCS-4 range, we must support 5, 6 byte encoding as RFC2279 decribed before. http://www.rfc-editor.org/rfc/rfc2279.txt But ... unfortunately UTF-16 can support only 1114111 code points, and The Unicode Consortium adhere to UTF-16. So 5, 6 byte and over 1114111 code pages' 4 byte encodings are invalid now. http://www.rfc-editor.org/rfc/rfc3629.txt (RFC3629 says "This memo obsoletes and replaces RFC 2279.") And Haskell implementation uses only valid rage. I forgot about that. I'm afraid that its fantasy is broken again, as no surrogate pair UCS-2 cover all language that is trusted before Europe and America people. -- shelarcy <shelarcy capella.freemail.ne.jp> http://page.freett.com/shelarcy/ _______________________________________________ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell