Re: [bitc-dev] BitC 0.20: Unicode

Michal Suchanek Wed, 10 Mar 2010 01:07:20 -0800

On 10 March 2010 07:02, Eric Northup <[email protected]> wrote:
> Jonathan S. Shapiro wrote:
>> On Tue, Mar 9, 2010 at 6:04 PM, Aleksi Nurmi <[email protected]> 
>> wrote:
>>
>>> 2010/3/10 Jonathan S. Shapiro <[email protected]>:
>>>
>>>> Do people think that is a sensible position?
>>>>
>>> Honestly, I don't see a lot of arguments in favor of the 16-bit char,
>>> there. :-) There's the interop thing, and well... a 16-bit char has no
>>> other use: it doesn't represent anything meaningful, it's just a
>>> uint16. To satisfy interop requirements, adding a separate type for
>>> 16-bit code units seems by far the most sensible thing to do, and I
>>> don't see any real downsides. Interoperation between BitC and CTS
>>> isn't going to be straightforward in any case.
>>>
>> Actually, that was my initial reaction, but it does have the
>> consequence that it pushes me into rebuilding the text library early.
>> That's something we need to do, but it would be nice to do it
>> incrementall
> Not sure if this matters but there's at least one magic property of
> [MSCorlib]System.String which I think also applies to the JVM's String:
> there is a guarantee that string literals (which have type
> System.String) will be interned by the runtime and so can be compared
> via eq (and the instance method String.Intern() is also mildly but
> similarly magic).
>
> It seems to me like interoperability is a compelling reason to use the
> runtime-provided strings, appropriately wrapped and tamed.  Otherwise
> you'll end up allocating and copying strings all over the place at the
> BitC <--> {CLI, JVM} interface.


It depends on how much interoperability you want.

If you are just running on top of the runtime you can use any string
representation equally well (and as Jonathan says he is not going to
use the native string library so all this magic in it is quite moot).

The other thing is interoperability with libraries which requires
somehow handling the string conversion. Still there are bound to be
libraries and runtimes with different string representations (at the
very least POSIX with UTF-8) so recoding support is required however
you look at the problem, there is no encoding that fits everywhere.

The choice of UTF-16 for Java and quite a few other libraries was made
when the characters which required multiple shorts to represent were
not used. It was quite short-sighted decision which is upheld for
compatibility's sake but caused much grief in the long run. It is
somewhat "optimal" for representing CJK in that it requires one short
for every character while UTF-8 usually requires three bytes. That was
probably not the concern when it was chosen for Java et el as it's
quite suboptimal for ASCII and ISO-8859-1.

I believe any runtime using UTF-16 by now is doing so for
compatibility with some legacy interface, there is no sane reason to
choose UTF-16 otherwise. Not that old Java applications that were
written with the assumption that short = character work well these
days. It was mostly pointless to keep this representation.

One more possibility is to not choose a single representation but
allow multiple internal representations of strings with an unified
high-level interface. This becomes a nightmare when strings from
multiple systems/runtimes end up in a single application, though. Also
requires quite a bit of additional testing.

Thanks

Michal
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] BitC 0.20: Unicode

Reply via email to