> Shap wrote >So the conclusion on all this appears to be that there simply isn't a good choice of string type, which I think we already knew.
In scenarios like this its best to go light weight and offer the maximum flexibility and leaving it for the standard lib ( or a competitor standard lib ) to sort out.. >William asks, correctly: what are my objectives for strings? I have three: > 0. The type string is the type assigned to string literals. 1. String indexing should return a code point. 2. String traversal (get next character) should be O(1) 3. String fetch code generation should not require a test and conditional branch. >After all of our discussion, I'm sorely tempted to just delete strings from the type system altogether on the grounds that they are always the wrong thing to use. Unfortunately we cannot do that because of condition [0]. Far to radical for me , as strings help developers write coherent libs ; though if your saying not have it in the type system but put it say a string type class in the standard lib then I do agree . >But if a string is always the wrong thing to use, then it seems to me that at least it should produce naively sensible results. This seems to lead to two sensible outcomes: >Definition 1: A string is a representation paired with an encoding as fixed-length units. It is left to the programmer to perform any encoding conversion(s) required to obtain the indexing and sequencing properties desired. The job of the library is to support those conversions. >Definition 2: A string is a thing can be used by a naive programmer to generate unsurprising outcomes for small to medium-sized inputs. This would argue for UCS32. >I'm sure this is badly broken somehow, but it's at least consistent and explicable. >So over to Ben, who will explain why I'm still being incredibly dense (and thank God someone here has a handle on this stuff... I dont really have a full handle on this either.. its not easy. Look at .NET at that time UCS-2 was all the range and they are stuck with UTF-16 asa result meaning all C# apps convert from UTF16 to UTF8 for web pages , xml you name it. Yes I prefer 1 but If the type defines the encoding I see no reason for it to be fixed index at the higher level ( though obviously at the lower level) >Shap wrote .. >Now to answer your last question, I think my real concern here is to get O(1) time iteration through a string, which isn't the same thing as O(1) random access at all. But that appears to require building some sort of "string iterator" object whose use is known to the code generator. That cure seems potentially worse than the disease. A quick reply Maybe we are taking the wrong approach here.. lets look at how strings and when indexing are used .. the fact your probably looking at compiler parsing/ lexing is not common string code. And I would argue such code may be better handled by a specialist lib rather than default string handling. In most cases, code works with very small strings and simply concatenates strings ( especially if you go with the stream type output you discussed earlier though a C/C# format string with var args can be stored as separate strings and concatenated to produce a single large string ) , actually most developer indexing of strings should be avoided as it is nearly always buggy hard to maintain code and in most cases have been replaced with replace type functions returning a new string. Since the allocation cost is relatively high compared to a non O(1) index I dont see the issue here . .Note in C# anyway replace and index are nearly always on a substring not a char Regarding time iteration through the string , in most cases this is not needed as you simply treat it like we did in C ie bytes with a higher encoding . You certainly dont need a string iterator that would be too expensive and type classes should ensure calling the right function , though worst case a conditional would not be bad. Also a scheme like UTF8 , is specifically designed to check for multi byte chars and you can use the SSE3 instructions to help scanning 16 or now 32 chars (YMM) in one instruction , if you have no escape chars you can treat as O(1) index. Now if we look at conversion costs you can convert from UTF8 to UTF 16 or UCS-4 very cheaply for most common characters ( again SSE3 YYM can convert 16 or 8 chars in one cycle covering most strings) . This means a special high performance lib which uses lots of indexing may convert to a more appropriate type , the fact that most original documents are UTF-8 and parsed a line at a time by such libs suggest it may not be bad using UTF-8 t internally ( as C# , C and Java always have the conversion cost) . Also for converting in .NET a lot of this work is done on char arrays and ToCharArray() should probably always return UCS-4. A lot of high perf string parsing libs with indexes today have issues with Asian chars I used one as a user which was php or python at the back and it failed for some Chinese chars resulting in a mess but there may be a case for using UCS-2 where you know you wont have Asian characters as it would be faster.. Anyway summing up I prefer this a string is an indexable collection of characters with an encoding. The default IMHO should be be UTF-8 which results in fast reading and writing , concatenation , reasonably fast replace , Indexes are supported but maybe O(n) though libs will use nearest char to reduce the impact ( which is done anyway in most libs) . For cases where lots of indexes / mutable work are used , conversion to UCS-2 or UCS-16 (Asian) should be used maybe based on internationalization settings. Note this conversion is free as it is done in C , C# and Java anyway and its a good idea for the standard lib NOT to work with mutable strings anyway. A special lib supporting fast character processing on mutable strings can just check what format the string is in and convert it if necessary , this is just an expanded version of a StringBuilder needed anyway to build large strings efficiently. I am well away no common language has made this choice but Most of those language originated when UCS-2 was the grand solution or are tied to runtimes that do. More Immutable strings push the argument for less indexing. UTF-8 growth has been recent and driven by practical experience before everyone was saying UTF-16 would rule. Note Ascii with the .NET scheme of indexing multi byte chars in specialist libs for those cultures was the way things were done in C before when performance was more of a premium. Ie as you said at the lower level a string is set of bytes with a length and an encoding and string searches are a higher level with internationalization etc . Consider cultural issues eg Turkish ie there exists a capital "i with a dot," character (\u0130), which is the capital version of i. Similarly, in Turkish, there is a lowercase "i without a dot," or (\u0131), which capitalizes to I , Which requires more complex searching and indexing I wouldnt worry about performance in Chinese they are already blessed with words that are half our size and they suffer a lot from expensive ASCII in UCS-4 . UCS-4 is not popular here most use a custom encoding on Ascii. ( In Chinese anyway indexes lookups are always string searches very rarely char lookups) Even if this scheme is wrong in the long term ( the risk is conversion being too frequent but if < 2 your probably will be better of with UTF8) you will do great in bench marks which will help the language adoption , and the default could be changed to UCS-2 /UTF-32 depending on internationalization settings later. The fact that we had higher encodings on Ascii before ( especially in Asia) and it worked well. Ben _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
