On Tue, Mar 15, 2011 at 5:24 PM, Ben Kloosterman <[email protected]> wrote:
> > > > > So I'm looking at string encoding issues again, and concluding that it's > just as icky as it was the last time I looked. I've looked at Python, and I > do think they did right by declaring that I/O happens in units of bytes with > conversion occurring at a layer above the I/O layer. Separately, I've > concluded (reluctantly) that we really do need constant-time string > indexing, and that I've been a dolt about that. > > > > > > Unfortunately for languages like Chinese which introduce new characters > that may be common , you can only have constant-time string indexing with > UCS-32. > Agreed. I don’t see the reasoning for going UCS16 at all ( except for conversions > to .NET ) > Sorry. I keep writing UCS16 when I mean UCS2. I trip over the fact that UCS and UTF use different size conventions. .NET uses UTF16 encoding, but implements strings as vectors of UCS2, with the consequence that an indexing operation can return either part of a code unit pair. They compound the mess by equating the System.char type to UCS2. Nothing wrong with a UCS2 type, but IMO it should not have been misnamed "char". I'm *still* puzzled by what they were thinking, since the conclusion that UCS2 wasn't big enough was evident when they made this decision. The current theory of operation seems to be that all handling of the upper code points is done with strings (i.e. never with chars). So back the constant time indexing > > ASII/ UTF-8 nope ( constant time indexing only for common languages) > > UTF-16- nope ( for nearly all western European but not asian ) > > UCS-2 yes – but is dead and cant represent some chars. (UCS-16 doesn’t > exist so im assuming UTF-16) > > UTF-32 yes but poor performance and excessive storage. > > > > It’s a tough choice.. which is why last time constant time indexing gave > way . What is your objection to this anyway wouldn’t constant time most of > the time be good enough ? > The problem is that you don't actually *get* constant time most of the time. If strings are defined as UTF16-encoded UCS4, then you must assume that a UCS4 code point may be present, and scan accordingly. Further, you end up having to declare that a "char" is UCS2, which is pretty clearly busted. Even if you set a bit in the string header to indicate that the string is not "contaminated" by asian influences (BIG grin there), you still have to *test* that bit on every indexing operation that gets compiled. Now to answer your last question, I think my real concern here is to get O(1) time iteration through a string, which isn't the same thing as O(1) random access at all. But that appears to require building some sort of "string iterator" object whose use is known to the code generator. That cure seems potentially worse than the disease. > While I don't like the space consumption, I think that UCS32 is the right > answer, because it is the most flexible of the available encodings. The > principle disadvantage is space. The only real solution for applications > that are concerned with this is to (a) decode strings only when needed, or > (b) carry uninterpreted strings around in some more-compact form as > instances of byte[]. > > > > But that also means any tom dick and harry micro benchmark such as a simple > xml parser will give poor results for BitC. Not good for a new language > trying to be competitive with C > Your concern about microbenchmarks is, regrettably, something we cannot ignore. At the same time, the problem is fundamental. We have to make *some*choice. If I adopt the "strings are UCS4" position, then the right way to write a simple xml parser is by using vector (or array) of byte. Which is actually the right way to do it in any case. If I adopt the "strings are UTF16 on UCS2" position, then the xml parser should *still* be written using bytes; all we've done is lose the indexing convenience on strings. If I adopt the "strings are UTF8 on UCS1" position, then the XML parser *still* has to be written with byte vectors, because otherwise there is no way to index the individual code units for decoding purposes. Am I missing something here? CLI is a bit weird , due to the OS changing from UCS2 to a non fixed length > encoded UTF-16 – when everyone realized USC-2 is not good enough . CLI > uses UTF-16 which is not fixed length for some languages ie in Chinese 2 > chars in the string may represent one. Which means string.Length is not > the letter count , and an index max not represent the actual letter but > it works for western European sets. > Yes. Java, as I understand matters, came to a similar outcome, but managed not to use UTF16 somehow - I'm not clear on the details of that, and I may very well be mistaken. > One approach would be to introduce an opaque reference type NativeString, > and a set of runtime operations that will produce NativeString from String > (and the other way as well), and possibly NativeString from byte[]. The > reason to make NativeString strictly opaque is error-prevention. If we > support indexing operations on NativeString, we invite people to write code > that assumes a particular encoding of NativeString, and that code will run > incorrectly (or worse: *appear* to run correctly) on other platforms. > > > > I proposed something similar last time ie a fast native string and a more > featured tree string but the issues were > > 1) Library issues for 2 string types > > 2) You can have a single string type and hide the internal > representation . > So I'm now convinced that [2] turns out not to be sensible. [1] remains a concern, which is why I am suggesting that NativeString should be an *opaque * type. > Why not leave the internal representation open ? > Because the code generator needs to know how to emit indexing operations, and because I'm presently getting ready to write a lexer and a string intern subsystem in BitC and I'ld sorta like to know what the performance is likely to look like. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
