[bitc-dev] String encoding, again

Jonathan S. Shapiro Tue, 15 Mar 2011 15:07:39 -0700

So I'm looking at string encoding issues again, and concluding that it's
just as icky as it was the last time I looked. I've looked at Python, and I
do think they did right by declaring that I/O happens in units of bytes with
conversion occurring at a layer above the I/O layer. Separately, I've
concluded (reluctantly) that we really do need constant-time string
indexing, and that I've been a dolt about that.


Aside from human convenience of naming, I do *not* think that we need to
introduce a 'bytes' type in the way that Python did. I think byte[] (that
is: byte vector) is sufficient for this purpose.

But that leaves us with the unpleasant question of UCS32 vs. UCS16 as the
normative BitC string representation. While I don't like the space
consumption, I think that UCS32 is the right answer, because it is the most
flexible of the available encodings. The principle disadvantage is space.
The only real solution for applications that are concerned with this is to
(a) decode strings only when needed, or (b) carry uninterpreted strings
around in some more-compact form as instances of byte[].

The problem at that point is that we really *do* want the option to target
environments like CLI and JVM, and neither of these uses UCS32 as their
native string encoding. Inter-converting representations "by magic" is
certainly not a good idea, and I want to avoid a proliferation of string
types corresponding to each encoding.

One approach would be to introduce an opaque reference type NativeString,
and a set of runtime operations that will produce NativeString from String
(and the other way as well), and possibly NativeString from byte[]. The
reason to make NativeString strictly opaque is error-prevention. If we
support indexing operations on NativeString, we invite people to write code
that assumes a particular encoding of NativeString, and that code will run
incorrectly (or worse: *appear* to run correctly) on other platforms.

The alternative is to introduce distinguished string types for the commonly
deployed native string representations: JavaString/JavaCodeUnit and
CliString/CliCodeUnit. This preserves the ability to write high-performance
code for a particular target environment without abandoning error diagnosis
when the code is ported. [It might be better to choose names that describe
the encodings; that's a separate issue.]  I resist this approach at the
moment partly because I fear a proliferation of representation-oriented
types and partly because the semantics of strings in both runtime systems
seems hopelessly boogered.

I'm inclined to favor the NativeString approach here, but I'm open to input.
Does somebody (anybody!) see a cleaner way out here?


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

[bitc-dev] String encoding, again

Reply via email to