On 05/02/07, Chris Kuklewicz <[EMAIL PROTECTED]> wrote:
shelarcy wrote:
> Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings. UTF-8 uses 1,2,3, or 4 bytes. Anything that does not support 4 bytes does not support UTF-8
Well, some of them are probably a bit dated; they likely supported an older version of the standard.
> But Takusen's implementation support it correctly. The Takusen does have unreachable dead code to serialize Char as (ord c :: Int) up to 31 bits into as many as 6 bytes. But it does decode up to 6 bytes to 31 bits and try to "chr" this from Int to Char. Decoding that many bits is not consistent with the UTF-8 standard. UTF-8 is a 4 byte encoding. There is no valid UTF-8 5 or 6 byte encoding.
Chris is right here, in that Takusen's decoder is incorrect w.r.t. the standard, in allowing up to 6 bytes to encode a single char. If it was correct, it would reject 5 and 6 byte sequences. I copied the extended conversion from HXT's code, which was the most correct UTF8 library I had seen so far (it just didn't marshal directly from a CString, which was what I was after). Turns out darcs has the most accurate UTF8 en + de-coders: http://abridgegame.org/cgi-bin/darcs.cgi/darcs/UTF8.lhs?c=annotate There's nothing stopping the Unicode consortium from expanding the range of codepoints, is there? Or have they said that'll never happen? Alistair _______________________________________________ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell