Higher level built-in strings

bearophile Sun, 18 Jul 2010 18:50:22 -0700

This odd post comes from reading the nice part about strings of chapter 4 of 
TDPL. In the last few years I have seen changes in how D strings are meant and 
managed, changes that make them less and less like arrays (random-access 
sequences of mutable code units) and more and more what they are at high level 
(immutable bidirectional sequences of code points).


So a final jump is to make string types something different from normal arrays. 
This frees them to behave better as high level strings. Probably other people 
have had similar ideas, so if you think this post is boring or useless, please 
ignore it.

So strings can have some differences compared to arrays:

1) length returns code points, but it's immutable and stored in the string, so 
it's an O(1) operation (returns what std.utf.count() returns).

2) Assigning a string to another one is allowed:
string s1 = s2;
But changing the length manually is not allowed:
s1.length += 2; // error
This makes them more close to immutable, more similar to Python strings.

3) Another immutable string-specific attribute can be added that returns the 
number of code units in the string, for example .codeunits or .nunits or 
something similar.

4) s1[i] is not O(1), it's generally slower, and returns the i-th code point 
(there are ways to speed this operation up to O(log n) with some internal 
indexes). (Code points in dstrings can be accessed in O(1)).

5) foreach(c; s1) yields a code point dchars regardless if s1 is string, 
dstring, wstring.
But you can use foreach(char c; s1) if s1 is a string and you are sure s1 uses 
only 7 bit chars. But in such cases you can also use a immutable(ubyte)[]. 
Python3 does something similar, its has a str type that's always UTF16 (or 
UTF32) and a bytes type that is similar to a ubyte[]. I think D can do 
something similar. std.string functions can made to work on ubyte[] and 
immutable(ubyte)[] too.


Some more things, I am not sure about them:

6) Strings can contain their hash value. This field is initialized with a empty 
value (like -1) and computed lazily on-demand. (as done in Python). This can 
make them faster when they often are put inside associative arrays and sets, 
avoiding to compute their hash value again and again. So strings are not fully 
immutable, because this value gets initialized. But it's a pure value, it's 
determined only by the immutable contents of the string, so I don't think this 
can cause big problems in multi-thread programs. If two threads update it, they 
find the same result.

7) the validation (std.utf.validat(), or a cast) can be necessary to create a 
string. This means that the type string/dstring/wstring implies it's validated 
:-)

8) If strings are immutable, then the append can always create a new string. So 
the extra information at the end of the memory block (used for appending to 
dynamic arrays) is not necessary.

9)
- Today memory, even RAM, is cheap, but moving RAM to the CPU caches is not so 
fast, so a program that works on just the cache is faster.
- UFT8 and UTF16 make strings bidirectional ranges.
- UTF encodings are just data compression, but it's not so efficient.
So a smarter compression scheme can compress strings more in memory, and the 
decrease in cache misses can compensate for the increased CPU work to 
decompress them. (But keeping strings compressed can turn them from 
bidirectional ranges to forward ranges, this is not so bad).
There is a compressor that gives a decompression speed of about 3 times slower 
than memcpy():
http://www.oberhumer.com/opensource/lzo/
LZO can be used transparently to compress strings in RAM when strings become 
long enough.
Hash computation and equality tests done among compressed strings are faster.


Turning strings into something different from arrays looks like a loss, but in 
practice they already are not arrays, thinking about them as normal arrays is 
no so useful and it's UTF-unsafe, using [] to read the code units doesn't seem 
so useful. Code that manages strings/wstrings as normal arrays is not correct 
in general.

Bye,
bearophile

Higher level built-in strings

Reply via email to