So I'm looking at string encoding issues again, and concluding that it's just as icky as it was the last time I looked. I've looked at Python, and I do think they did right by declaring that I/O happens in units of bytes with conversion occurring at a layer above the I/O layer. Separately, I've concluded (reluctantly) that we really do need constant-time string indexing, and that I've been a dolt about that.
Aside from human convenience of naming, I do *not* think that we need to introduce a 'bytes' type in the way that Python did. I think byte[] (that is: byte vector) is sufficient for this purpose. But that leaves us with the unpleasant question of UCS32 vs. UCS16 as the normative BitC string representation. While I don't like the space consumption, I think that UCS32 is the right answer, because it is the most flexible of the available encodings. The principle disadvantage is space. The only real solution for applications that are concerned with this is to (a) decode strings only when needed, or (b) carry uninterpreted strings around in some more-compact form as instances of byte[]. The problem at that point is that we really *do* want the option to target environments like CLI and JVM, and neither of these uses UCS32 as their native string encoding. Inter-converting representations "by magic" is certainly not a good idea, and I want to avoid a proliferation of string types corresponding to each encoding. One approach would be to introduce an opaque reference type NativeString, and a set of runtime operations that will produce NativeString from String (and the other way as well), and possibly NativeString from byte[]. The reason to make NativeString strictly opaque is error-prevention. If we support indexing operations on NativeString, we invite people to write code that assumes a particular encoding of NativeString, and that code will run incorrectly (or worse: *appear* to run correctly) on other platforms. The alternative is to introduce distinguished string types for the commonly deployed native string representations: JavaString/JavaCodeUnit and CliString/CliCodeUnit. This preserves the ability to write high-performance code for a particular target environment without abandoning error diagnosis when the code is ported. [It might be better to choose names that describe the encodings; that's a separate issue.] I resist this approach at the moment partly because I fear a proliferation of representation-oriented types and partly because the semantics of strings in both runtime systems seems hopelessly boogered. I'm inclined to favor the NativeString approach here, but I'm open to input. Does somebody (anybody!) see a cleaner way out here? shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
