While I want BitC to run on the CLI infrastructure, I'm *not* committed to
using the string library provided by CLI for anything other than interop
purposes. Among other issues, BitC doesn't (currently) do objects.

Strings really aren't the problem. JVM/CLR strings will all work fine so
long as we ensure by construction or validation that all strings consist of
a valid sequence legal unicode code points encoded as UTF-16. That is: the
main constraint here is on the mechanisms that *build* strings, not on the
string object.

The problem is the representable range of the character type. The situation
in both CLR and JVM is more than a bit weird. CLR/JVM strings are encoded as
UTF-16 using UCS-2 code units, but from a typing perspective they are really
a vector of UCS-2. This is true because UTF-16 is an encoding rather than a
representation. The expression s[i] returns a UCS-2 unit that may or may not
be a valid unicode character. So contrary to what Kevin wrote, characters
are *not* UTF-16 in JVM/CLR. They are UCS-2, and they aren't actually
characters. They are "code units".

I looked briefly at the various Unicode roadmaps (
http://unicode.org/roadmaps/). Pragmatically, the two parts that concern me
are the re-encoding of CJK in SIP (plane 2) and the so-called "Large Asian
Scripts" region of SMP (plane 1).

So maybe the answer here is that CLR/JVM are right, and the upper planes are
never going to be used, so it's okay for "char" to cover BMP only. If we
need to do better later, we can either follow what CLR/JVM are doing and use
strings, or we can introduce a new type CodePoint that covers the whole
span.

The practical impact of this in the near term should be pretty minimal -
mainly that we won't be able to accept upper plane input in string or
character literals.


Do people think that is a sensible position?



Jonathan
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to