Just reading about strings in bitc
There is a compromise position, which is where we are currently leaning: * A well-formed string consists of a sequence of code points. The specification does not take a position on the encoding of strings in the heap. * Strings support indexing on both UCS-2 and UCS-4 code units. * Any operation that accepts code units and produces a string is obliged to confirm that the code unit sequence constitutes a well-formed code point sequence to ensure that multiple indexing schemes are possible. * Implementations are encouraged where possible to use a run-encoded internal representation of strings incorporating a hidden cached cursor, such that arbitrary indexing and sequential indexing are both implemented in O(1) time. A reference implementation for such an encoding will eventually be provided by the BitC implementation. Is this wise ? UTF8 content is ubiquitous .. I was challenged on this recently for foreign sites on utf16 being smaller ( USC-2) and it turned out that very few used utf16 and even when they did ,utf8 despite the variable length encoding was significantly smaller ( mainly due to the huge amount of ASCII Content in xml and html files and the fact it has the same length for common characters - even Asian and Sanskrit ) .. This is pretty major as it means you have to convert nearly all html and xml from utf-8 to utf16 or utf-32 . When I started with C# from C was kind of wondering how strings would work well without indexers as its kind of a shock but after many years it works quite well. Strings are immutable ( which means they get put in a special region of the GC which doesn't need to be remarked ) and is also nice for multi threaded work and if the highest performance is need you can work with a mutable char array ( like C) and convert to and from strings. Now .NET uses UTF16 ( USC-2) since there was no UTF8 when it was designed it does give .NET quite a hefty penalty in string work ( esp html and xml) compared to UTF 8 parsers as it has to process almost twice the data and convert from utf8 to utf16. Note I think if .NET was UTF8 it would be significantly faster for string handling though it is quite fast already when you consider the strings are heap based objects. Now to use strings as not indexible internal UTF8 and a easily index able char[] of ASCII , UTF16 or UTF32 will require conversion but in my experience these conversions are rare in most cases you just deal with strings or the char[] . At the very least your string data would use half the memory ( or a quarter of UTF32! ) which is nothing to sneeze at since string data makes up a large amount of program data especially for embedded systems. I do note such a string class works best with a stack like nursery allocator due to fast creation but there is no reason they couldn't be structs.. Now the lib itself could and would probably index the private UTF8 data of the string with the indexes being byte offsets but while direct calculations on the index is meaningless most operations these days tend to be matching and searching which can be done at the same time.. Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
