While I want BitC to run on the CLI infrastructure, I'm *not* committed to using the string library provided by CLI for anything other than interop purposes. Among other issues, BitC doesn't (currently) do objects.
Strings really aren't the problem. JVM/CLR strings will all work fine so long as we ensure by construction or validation that all strings consist of a valid sequence legal unicode code points encoded as UTF-16. That is: the main constraint here is on the mechanisms that *build* strings, not on the string object. The problem is the representable range of the character type. The situation in both CLR and JVM is more than a bit weird. CLR/JVM strings are encoded as UTF-16 using UCS-2 code units, but from a typing perspective they are really a vector of UCS-2. This is true because UTF-16 is an encoding rather than a representation. The expression s[i] returns a UCS-2 unit that may or may not be a valid unicode character. So contrary to what Kevin wrote, characters are *not* UTF-16 in JVM/CLR. They are UCS-2, and they aren't actually characters. They are "code units". I looked briefly at the various Unicode roadmaps ( http://unicode.org/roadmaps/). Pragmatically, the two parts that concern me are the re-encoding of CJK in SIP (plane 2) and the so-called "Large Asian Scripts" region of SMP (plane 1). So maybe the answer here is that CLR/JVM are right, and the upper planes are never going to be used, so it's okay for "char" to cover BMP only. If we need to do better later, we can either follow what CLR/JVM are doing and use strings, or we can introduce a new type CodePoint that covers the whole span. The practical impact of this in the near term should be pretty minimal - mainly that we won't be able to accept upper plane input in string or character literals. Do people think that is a sensible position? Jonathan
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
