So I'm looking at string encoding issues again, and concluding that it's just as icky as it was the last time I looked. I've looked at Python, and I do think they did right by declaring that I/O happens in units of bytes with conversion occurring at a layer above the I/O layer. Separately, I've concluded (reluctantly) that we really do need constant-time string indexing, and that I've been a dolt about that. Unfortunately for languages like Chinese which introduce new characters that may be common , you can only have constant-time string indexing with UCS-32. Aside from human convenience of naming, I do not think that we need to introduce a 'bytes' type in the way that Python did. I think byte[] (that is: byte vector) is sufficient for this purpose. But that leaves us with the unpleasant question of UCS32 vs. UCS16 as the normative BitC string representation. I don't see the reasoning for going UCS16 at all ( except for conversions to .NET ) , originally the idea was you can represent all chars in UCS16 but this is no longer true , once you have any encoding you may as well go Ascii. IMHO this is all failed and we would have been better of sticking with ASCII but introduce a uniform extension - its no coincidence that UTF-8 is growing so fast and dominating where space/performance are important eg HTML and XML. The fact most documents even Chinese are riddled with ASCII means UTF-8 with extensions is still far more efficient than UTF-16 or UTF-32 . UCS-2 is dead and people who used it moved to the flexible UTF-16 because UCS-2 to UTF-8 is harder and not backward compatibe. So back the constant time indexing ASII/ UTF-8 nope ( constant time indexing only for common languages) UTF-16- nope ( for nearly all western European but not asian ) UCS-2 yes - but is dead and cant represent some chars. (UCS-16 doesn't exist so im assuming UTF-16) UTF-32 yes but poor performance and excessive storage. It's a tough choice.. which is why last time constant time indexing gave way . What is your objection to this anyway wouldn't constant time most of the time be good enough ? While I don't like the space consumption, I think that UCS32 is the right answer, because it is the most flexible of the available encodings. The principle disadvantage is space. The only real solution for applications that are concerned with this is to (a) decode strings only when needed, or (b) carry uninterpreted strings around in some more-compact form as instances of byte[]. But that also means any tom dick and harry micro benchmark such as a simple xml parser will give poor results for BitC. Not good for a new language trying to be competitive with C The problem at that point is that we really *do* want the option to target environments like CLI and JVM, and neither of these uses UCS32 as their native string encoding. Inter-converting representations "by magic" is certainly not a good idea, and I want to avoid a proliferation of string types corresponding to each encoding. CLI is a bit weird , due to the OS changing from UCS2 to a non fixed length encoded UTF-16 - when everyone realized USC-2 is not good enough . CLI uses UTF-16 which is not fixed length for some languages ie in Chinese 2 chars in the string may represent one. Which means string.Length is not the letter count , and an index max not represent the actual letter but it works for western European sets. One approach would be to introduce an opaque reference type NativeString, and a set of runtime operations that will produce NativeString from String (and the other way as well), and possibly NativeString from byte[]. The reason to make NativeString strictly opaque is error-prevention. If we support indexing operations on NativeString, we invite people to write code that assumes a particular encoding of NativeString, and that code will run incorrectly (or worse: appear to run correctly) on other platforms. I proposed something similar last time ie a fast native string and a more featured tree string but the issues were 1) Library issues for 2 string types 2) You can have a single string type and hide the internal representation . Why not leave the internal representation open ? Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
