Re: Of possible interest: fast UTF8 validation

H. S. Teoh via Digitalmars-d Thu, 17 May 2018 16:17:13 -0700

On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d 
wrote:
[...]
> - the auto-synchronization and the statelessness are big deals.


Yes.  Imagine if we standardized on a header-based string encoding, and
we wanted to implement a substring function over a string that contains
multiple segments of different languages. Instead of a cheap slicing
over the string, you'd need to scan the string or otherwise keep track
of which segment the start/end of the substring lies in, allocate memory
to insert headers so that the segments are properly interpreted, etc..
It would be an implementational nightmare, and an unavoidable
performance hit (you'd have to copy data every time you take a
substring), and the @nogc guys would be up in arms.

And that's assuming we have a sane header-based encoding for strings
that contain segments in multiple languages in the first place.
Linguistic analysis articles, for example, would easily contain many
such segments within a paragraph, or perhaps in the same sentence. How
would a header-based encoding work for such documents?  Nevermind the
recent trend of liberally sprinkling emojis all over regular text. If
every emoticon embedded in a string requires splitting the string into 3
segments complete with their own headers, I dare not imagine what the
code that manipulates such strings would look like.


T

-- 
Famous last words: I *think* this will work...

Re: Of possible interest: fast UTF8 validation

Reply via email to