shelarcy wrote: > Hello Twan, > > On Mon, 05 Feb 2007 08:46:35 +0900, Twan van Laarhoven <[EMAIL PROTECTED]> > wrote: >> I would like to announce my attempt at making a Unicode version of >> Data.ByteString. The library is named Data.CompactString to avoid >> conflict with other (Fast)PackedString libraries. > > How about add abstract layer? > > Spencer Janssen tried to provied abstract layer for Unicode ByteString, > last year's summer of code project. > It has no Unicode support. But it supplied a good layer, Stringable class. > > http://code.google.com/soc/haskell/appinfo.html?csaid=B934AEBE95120AB2 > http://darcs.haskell.org/SoC/fps-soc/ > http://darcs.haskell.org/SoC/fps-soc-aug21/ > > >> The library uses a variable length encoding (1 to 3 bytes) of Chars into >> Word8s, which are then stored in a ByteString. The structure is very >> much based on Data.ByteString, most of the implementation is copied from >> there. Hopefully this means that fusion rules could be copied as well. > > UTF-8 also uses 4 to 6 byte encodings now. > CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol, > etc ... use 4 byte encoding.
Looking at several sources, it seems you are incorrect. Haskell Char go up to Unicode 1114111 (decimal) or 0x10ffff Hexidecimal). These are encoded by UTF-8 in 1,2,3,or 4 bytes. CJK Unified Ideographs Extension B starts at 131072 or 0x20000 Tai Xuan Jing Symbols start at 119552 or 0x1d300 These are all within the official utf-8 encoding scheme. > > Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings. UTF-8 uses 1,2,3, or 4 bytes. Anything that does not support 4 bytes does not support UTF-8 > But Takusen's implementation support it correctly. The Takusen does have unreachable dead code to serialize Char as (ord c :: Int) up to 31 bits into as many as 6 bytes. But it does decode up to 6 bytes to 31 bits and try to "chr" this from Int to Char. Decoding that many bits is not consistent with the UTF-8 standard. > > http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs > http://www.haskell.org/pipermail/libraries/2007-February/006841.html > > How about support 4 to 6 byte encodings? UTF-8 is a 4 byte encoding. There is no valid UTF-8 5 or 6 byte encoding. > > > Best Regards, > _______________________________________________ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell