At 2001-10-09 03:37, Kent Karlsson wrote: >> > code position (=code point): a value between 0000 and 10FFFF. >> >> Would this be a reasonable basis for Haskell's 'Char' type? > >Yes. It's essentially UTF-32, but without the fixation to 32-bit >(21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited >to 10FFFF instead of 31(!) bits) is the datatype used in some >implementations of C for wchar_t. As I said in another e-mail, >if one does not have high efficiency concerns, UTF-32 is a rather >straighforward way of representing characters.
Would it be worthwhile restricting Char to the 0-10FFFF range, just as a Word8 is restricted to 0-FF even though in GHC at least it's stored 32-bit? ... >> data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... >> getGeneralCategory :: Char -> Maybe GeneralCategory; > >There is not really any "Maybe" just there. Yet unallocated code >positions have general category Cn (so do non-characters): > Cs Other, Surrogate > Co Other, Private Use > Cn Other, Not Assigned (yet) OK. It occured to me to put 'unassigned' as Nothing, since it might change -- so in a sense getGeneralCategory doesn't know what the GC is. I assume once a codepoint has a non-Cn GC, it cannot be changed. But confusingly, some of the GCs are 'normative', whereas others are merely 'informative' -- perhaps these last are subject to revision. -- Ashley Yakeley, Seattle WA _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell
