bearophile wrote:
Walter Bright:
The problem with dchar's is strings of them consume memory at a prodigious
rate.

Warning: lazy musings ahead.

I hope we'll soon have computers with 200+ GB of RAM where using strings that
use less than 32-bit chars is in most cases a premature optimization (like
today is often a silly optimization to use arrays of 16-bit ints instead of
32-bit or 64-bit ints. Only special situations found with the profiler can
justify the use of arrays of shorts in a low level language).

Even in PCs with 200 GB of RAM the first levels of CPU caches can be very
small (like 32 KB), and cache misses are costly, so even if huge amounts of
RAMs are present, to increase performance it can be useful to reduce the size
of strings.

A possible solution to this problem can be some kind of real-time hardware
compression/decompression between the CPU and the RAM. UTF-8 can be a good
enough way to compress 32-bit strings. So we are back to writing low-level
programs that have to deal with UTF-8.

To avoid this, CPUs and RAM can compress/decompress the text transparently to
the programmer. Unfortunately UTF-8 is a variable-length encoding, so maybe
it can't be done transparently enough. So a smarter and better compression
algorithm can be used to keep all this transparent enough (not fully
transparent, some low-level situations can require code that deals with the
compression).

I strongly suspect that the encode/decode time for UTF-8 is more than compensated for by the 4x reduction in memory usage. I did a large app 10 years ago using dchars throughout, and the effects of the memory consumption were murderous.

(As the recent article on memory consumption shows, large data structures can have huge negative speed consequences due to virtual and cache memory, and multiple cores trying to access the same memory.)

https://lwn.net/Articles/250967/

Keep in mind that the overwhelming bulk of UTF-8 text is ascii, and requires only one cycle to "decode".

Reply via email to