[Re-send - original sent to wrong alias] One of the mundane issues I want to take up is character and string encoding. The issue that is driving this is JVM/CLR, neither of which properly implements unicode. That is: the "character" type in both runtimes is 16 bit, and this can only encode the Basic Multilingual Plane.
As of BitC 0.10, the position was: External (on-file) encoding for source units of compilation is UTF-8. Strings are immutable. Characters are 32 bits, UCS-4 encoded Actually, there is a bug in the 0.10 specification: string literals are discussed, but strings are not specified as a core type. >From a principled standpoint, I still think that the decisions above were the right ones, but both JVM and CLR are restricted to a 16-bit native character type. For strings, there isn't really any problem worse than inconvenience. In CLR (and I believe in JVM), characters outside the BMP can be encoded in strings using surrogate pairs, and with (considerable) care these can be processed. So long as we reject strings that contain malformed surrogate pairs, we should be fine. In practice this means: Strings returned from CLR/JVM routines need to be validated. Substring operations must fail if they would result in a broken surrogate pair. The easiest way to handle this is to define them in terms of characters rather than code units. So the crux of the matter is the size of the "char" type in BitC. It is clear that for JVM/CLR interaction we will need a 16-bit character-like type. It seems equally clear that for any sane unicode-based processing we need a 32-bit character-like type. If we choose BitC "char" to be 16-bits, then I propose to add a new character type UniChar that covers the full unicode set. Alternatively, we could define "char" as the 32-bit unit, and introduce BMPChar or CodeUnit for interaction with JVM/CLR. Is there an obviously preferable choice? If not, then given that we want to be able to target these platforms, what do you think we should to about all this? Possibly relevant background: http://perldoc.perl.org/Encode/Unicode.html#Surrogate-Pairs The four possible positions you can take once you start down the UCS-2 path: http://cad.kiev.ua/~demch/multiling/unicode/utf16.html Jonathan
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
