On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
- the auto-synchronization and the statelessness are big deals.

Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the @nogc guys would be up in arms.

You'd have three data structures: Strand, Rope, and Slice.

A Strand is a series of bytes with an encoding. A Rope is a series of Strands. A Slice is a pair of location references within a Rope. You probably want a special datastructure to name a location within a Rope: Strand offset, then byte offset. Total of five words instead of two to pass a Slice, but zero dynamic allocations.

This would be a problem for data locality. However, rope-style datastructures are handy for some types of string manipulation.

As an alternative, you might have a separate document specifying what encodings apply to what byte ranges. Slices would then be three words long (pointer to the string struct, start offset, end offset). Iterating would cost O(log(S) + M), where S is the number of encoded segments and M is the number of bytes in the slice.

Anyway, you either get a more complex data structure, or you have terrible time complexity, but you don't have both.

And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents? Nevermind the recent trend of liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like.

"Header" implies that all encoding data appears at the start of the document, or in a separate metadata segment. (Call it a start index and two bytes to specify the encoding; reserve the first few bits of the encoding to specify the width.) It also brings to mind HTTP, and reminds me that most documents are either mostly ASCII or a heavy mix of ASCII and something else (HTML and XML being the forerunners).

If the encoding succeeded at making most scripts single-byte, then, testing with https://ar.wikipedia.org/wiki/Main_Page, you might get within 15% of UTF-8's efficiency. And then a simple sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as long in this encoding as UTF-8, since it has ten encoded segments, each with overhead. (Assuming the header supports strings up to 2^32 bytes long.)

If it didn't succeed at making Latin and Arabic single-byte scripts (and Latin contains over 800 characters in Unicode, while Arabic has over three hundred), it would be worse than UTF-16.

Reply via email to