First, let me say that I was pleasantly surprised to hear that my
assumption - that the BitC specification must specify the string
representation - was incorrect. Specifying fast comparisons does
become a problem, but I think BitC could use some extra design space
with strings. On a positive note, one can always specify the
representation on a later date. For example, perhaps we are
overestimating the overhead of not using the platform's string
representation, and BitC should be specified to use UTF-8 for
compactness, maximally fast comparisons and fast I/O. In fact, this
seems a likely outcome. But for now, the fast comparison function
should probably compare code points. It's quite fast with all of the
encodings, although far from optimal.

2010/3/11 Jonathan S. Shapiro <[email protected]>:
> System.Char is typed in BitC as "BitC.UCS2".
> System.String is typed in BitC as "BitC.UCS2 Vector".
> BitC.Char, if present, is a type alias for BitC.UCS4, a.k.a Unicode Code
> Points.
>
> Is that it?
>
> I think that this is one consistent position. The other consistent position
> would be that "BitC.char" is a type alias for BitC.UCS2.

That's what I'd vote for, preferably without any confusing aliases.

I don't fancy the UCS* names, though: consider conversion from a
vector of BitC.UCS2 to a BitC.String. Should the vector be interpreted
as UCS-2 (no surrogate pairs) or UTF-16 (possible surrogate pairs)?
Come to think of it, naming code units "2-byte universal character
sets" is a bit strange.

I don't think that the above is a serious problem, though. At least
UCS2 points to the right direction: the Unicode standard instead of
intuition (although UCS-2 is actually a deprecated standard). Still, I
consider CodeUnit16 a better choice.

> So I think we are down to: which way should the BitC.char type alias go?

Nowhere is the correct answer, or from another viewpoint, the answer
of a pedantic. The confusion among programmers causes a lot of bugs
today, and I believe that even linguists debate on the subject "what
is a character". The guy who writes a comparison function must know
his unicode anyway, and if he doesn't, this is the way BitC could
instruct him: simply by using an accurate name.

To summarize:

newtype CodeUnit16 = UInt16
newtype CodePoint = UInt32
System.Char maps to CodeUnit16
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to