On 10 March 2010 07:02, Eric Northup <[email protected]> wrote: > Jonathan S. Shapiro wrote: >> On Tue, Mar 9, 2010 at 6:04 PM, Aleksi Nurmi <[email protected]> >> wrote: >> >>> 2010/3/10 Jonathan S. Shapiro <[email protected]>: >>> >>>> Do people think that is a sensible position? >>>> >>> Honestly, I don't see a lot of arguments in favor of the 16-bit char, >>> there. :-) There's the interop thing, and well... a 16-bit char has no >>> other use: it doesn't represent anything meaningful, it's just a >>> uint16. To satisfy interop requirements, adding a separate type for >>> 16-bit code units seems by far the most sensible thing to do, and I >>> don't see any real downsides. Interoperation between BitC and CTS >>> isn't going to be straightforward in any case. >>> >> Actually, that was my initial reaction, but it does have the >> consequence that it pushes me into rebuilding the text library early. >> That's something we need to do, but it would be nice to do it >> incrementall > Not sure if this matters but there's at least one magic property of > [MSCorlib]System.String which I think also applies to the JVM's String: > there is a guarantee that string literals (which have type > System.String) will be interned by the runtime and so can be compared > via eq (and the instance method String.Intern() is also mildly but > similarly magic). > > It seems to me like interoperability is a compelling reason to use the > runtime-provided strings, appropriately wrapped and tamed. Otherwise > you'll end up allocating and copying strings all over the place at the > BitC <--> {CLI, JVM} interface.
It depends on how much interoperability you want. If you are just running on top of the runtime you can use any string representation equally well (and as Jonathan says he is not going to use the native string library so all this magic in it is quite moot). The other thing is interoperability with libraries which requires somehow handling the string conversion. Still there are bound to be libraries and runtimes with different string representations (at the very least POSIX with UTF-8) so recoding support is required however you look at the problem, there is no encoding that fits everywhere. The choice of UTF-16 for Java and quite a few other libraries was made when the characters which required multiple shorts to represent were not used. It was quite short-sighted decision which is upheld for compatibility's sake but caused much grief in the long run. It is somewhat "optimal" for representing CJK in that it requires one short for every character while UTF-8 usually requires three bytes. That was probably not the concern when it was chosen for Java et el as it's quite suboptimal for ASCII and ISO-8859-1. I believe any runtime using UTF-16 by now is doing so for compatibility with some legacy interface, there is no sane reason to choose UTF-16 otherwise. Not that old Java applications that were written with the assumption that short = character work well these days. It was mostly pointless to keep this representation. One more possibility is to not choose a single representation but allow multiple internal representations of strings with an unified high-level interface. This becomes a nightmare when strings from multiple systems/runtimes end up in a single application, though. Also requires quite a bit of additional testing. Thanks Michal _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
