On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via
Digitalmars-d wrote: [...]
- the auto-synchronization and the statelessness are big deals.
Yes. Imagine if we standardized on a header-based string
encoding, and we wanted to implement a substring function over
a string that contains multiple segments of different
languages. Instead of a cheap slicing over the string, you'd
need to scan the string or otherwise keep track of which
segment the start/end of the substring lies in, allocate memory
to insert headers so that the segments are properly
interpreted, etc.. It would be an implementational nightmare,
and an unavoidable performance hit (you'd have to copy data
every time you take a substring), and the @nogc guys would be
up in arms.
You'd have three data structures: Strand, Rope, and Slice.
A Strand is a series of bytes with an encoding. A Rope is a
series of Strands. A Slice is a pair of location references
within a Rope. You probably want a special datastructure to name
a location within a Rope: Strand offset, then byte offset. Total
of five words instead of two to pass a Slice, but zero dynamic
allocations.
This would be a problem for data locality. However, rope-style
datastructures are handy for some types of string manipulation.
As an alternative, you might have a separate document specifying
what encodings apply to what byte ranges. Slices would then be
three words long (pointer to the string struct, start offset, end
offset). Iterating would cost O(log(S) + M), where S is the
number of encoded segments and M is the number of bytes in the
slice.
Anyway, you either get a more complex data structure, or you have
terrible time complexity, but you don't have both.
And that's assuming we have a sane header-based encoding for
strings that contain segments in multiple languages in the
first place. Linguistic analysis articles, for example, would
easily contain many such segments within a paragraph, or
perhaps in the same sentence. How would a header-based encoding
work for such documents? Nevermind the recent trend of
liberally sprinkling emojis all over regular text. If every
emoticon embedded in a string requires splitting the string
into 3 segments complete with their own headers, I dare not
imagine what the code that manipulates such strings would look
like.
"Header" implies that all encoding data appears at the start of
the document, or in a separate metadata segment. (Call it a start
index and two bytes to specify the encoding; reserve the first
few bits of the encoding to specify the width.) It also brings to
mind HTTP, and reminds me that most documents are either mostly
ASCII or a heavy mix of ASCII and something else (HTML and XML
being the forerunners).
If the encoding succeeded at making most scripts single-byte,
then, testing with https://ar.wikipedia.org/wiki/Main_Page, you
might get within 15% of UTF-8's efficiency. And then a simple
sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as
long in this encoding as UTF-8, since it has ten encoded
segments, each with overhead. (Assuming the header supports
strings up to 2^32 bytes long.)
If it didn't succeed at making Latin and Arabic single-byte
scripts (and Latin contains over 800 characters in Unicode, while
Arabic has over three hundred), it would be worse than UTF-16.