First, let me say that I was pleasantly surprised to hear that my assumption - that the BitC specification must specify the string representation - was incorrect. Specifying fast comparisons does become a problem, but I think BitC could use some extra design space with strings. On a positive note, one can always specify the representation on a later date. For example, perhaps we are overestimating the overhead of not using the platform's string representation, and BitC should be specified to use UTF-8 for compactness, maximally fast comparisons and fast I/O. In fact, this seems a likely outcome. But for now, the fast comparison function should probably compare code points. It's quite fast with all of the encodings, although far from optimal.
2010/3/11 Jonathan S. Shapiro <[email protected]>: > System.Char is typed in BitC as "BitC.UCS2". > System.String is typed in BitC as "BitC.UCS2 Vector". > BitC.Char, if present, is a type alias for BitC.UCS4, a.k.a Unicode Code > Points. > > Is that it? > > I think that this is one consistent position. The other consistent position > would be that "BitC.char" is a type alias for BitC.UCS2. That's what I'd vote for, preferably without any confusing aliases. I don't fancy the UCS* names, though: consider conversion from a vector of BitC.UCS2 to a BitC.String. Should the vector be interpreted as UCS-2 (no surrogate pairs) or UTF-16 (possible surrogate pairs)? Come to think of it, naming code units "2-byte universal character sets" is a bit strange. I don't think that the above is a serious problem, though. At least UCS2 points to the right direction: the Unicode standard instead of intuition (although UCS-2 is actually a deprecated standard). Still, I consider CodeUnit16 a better choice. > So I think we are down to: which way should the BitC.char type alias go? Nowhere is the correct answer, or from another viewpoint, the answer of a pedantic. The confusion among programmers causes a lot of bugs today, and I believe that even linguists debate on the subject "what is a character". The guy who writes a comparison function must know his unicode anyway, and if he doesn't, this is the way BitC could instruct him: simply by using an accurate name. To summarize: newtype CodeUnit16 = UInt16 newtype CodePoint = UInt32 System.Char maps to CodeUnit16 _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
