Re: Of possible interest: fast UTF8 validation

Neia Neutuladh via Digitalmars-d Fri, 18 May 2018 09:45:57 -0700

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:

On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter viaDigitalmars-d wrote: [...]
- the auto-synchronization and the statelessness are big deals.
Yes. Imagine if we standardized on a header-based stringencoding, and we wanted to implement a substring function overa string that contains multiple segments of differentlanguages. Instead of a cheap slicing over the string, you'dneed to scan the string or otherwise keep track of whichsegment the start/end of the substring lies in, allocate memoryto insert headers so that the segments are properlyinterpreted, etc.. It would be an implementational nightmare,and an unavoidable performance hit (you'd have to copy dataevery time you take a substring), and the @nogc guys would beup in arms.


You'd have three data structures: Strand, Rope, and Slice.

A Strand is a series of bytes with an encoding. A Rope is aseries of Strands. A Slice is a pair of location referenceswithin a Rope. You probably want a special datastructure to namea location within a Rope: Strand offset, then byte offset. Totalof five words instead of two to pass a Slice, but zero dynamicallocations.

This would be a problem for data locality. However, rope-styledatastructures are handy for some types of string manipulation.

As an alternative, you might have a separate document specifyingwhat encodings apply to what byte ranges. Slices would then bethree words long (pointer to the string struct, start offset, endoffset). Iterating would cost O(log(S) + M), where S is thenumber of encoded segments and M is the number of bytes in theslice.

Anyway, you either get a more complex data structure, or you haveterrible time complexity, but you don't have both.

And that's assuming we have a sane header-based encoding forstrings that contain segments in multiple languages in thefirst place. Linguistic analysis articles, for example, wouldeasily contain many such segments within a paragraph, orperhaps in the same sentence. How would a header-based encodingwork for such documents? Nevermind the recent trend ofliberally sprinkling emojis all over regular text. If everyemoticon embedded in a string requires splitting the stringinto 3 segments complete with their own headers, I dare notimagine what the code that manipulates such strings would looklike.

"Header" implies that all encoding data appears at the start ofthe document, or in a separate metadata segment. (Call it a startindex and two bytes to specify the encoding; reserve the firstfew bits of the encoding to specify the width.) It also brings tomind HTTP, and reminds me that most documents are either mostlyASCII or a heavy mix of ASCII and something else (HTML and XMLbeing the forerunners).

If the encoding succeeded at making most scripts single-byte,then, testing with https://ar.wikipedia.org/wiki/Main_Page, youmight get within 15% of UTF-8's efficiency. And then a simplesentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times aslong in this encoding as UTF-8, since it has ten encodedsegments, each with overhead. (Assuming the header supportsstrings up to 2^32 bytes long.)

If it didn't succeed at making Latin and Arabic single-bytescripts (and Latin contains over 800 characters in Unicode, whileArabic has over three hundred), it would be worse than UTF-16.

Re: Of possible interest: fast UTF8 validation

Reply via email to