Re: Of possible interest: fast UTF8 validation

Joakim via Digitalmars-d Fri, 18 May 2018 01:45:53 -0700

On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshanskywrote:
TCP being  reliable just plain doesn’t cut it. Corruption of
single bit is very real.
Quoting to highlight and agree.
TCP is reliable because it resends dropped packets and deliversthem in order.
I don't write TCP packets to my long-term storage medium.
UTF as a transportation protocol Unicode is *far* more usefulthan just sending across a network.

The point wasn't that TCP is handling all the errors, it was athrowaway example of one other layer of the system, the networktransport layer, that actually has a checksum that will detect asingle bitflip, which UTF-8 will not usually detect. I mentionedthat the filesystem and several other layers have their own sucherror detection, yet you guys crazily latch on to the TCP examplealone.


On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:

On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter viaDigitalmars-d wrote: [...]
- the auto-synchronization and the statelessness are big deals.
Yes. Imagine if we standardized on a header-based stringencoding, and we wanted to implement a substring function overa string that contains multiple segments of differentlanguages. Instead of a cheap slicing over the string, you'dneed to scan the string or otherwise keep track of whichsegment the start/end of the substring lies in, allocate memoryto insert headers so that the segments are properlyinterpreted, etc.. It would be an implementational nightmare,and an unavoidable performance hit (you'd have to copy dataevery time you take a substring), and the @nogc guys would beup in arms.

As we discussed when I first raised this header scheme years ago,you're right that slicing could be more expensive, depending onwhether you chose to allocate a new header for the substring ornot. The question is whether the optimizations available fromsuch a header telling you where all the language substrings arein a multi-language string make up for having to expensivelyprocess the entire UTF-8 string to get that or other data. Ithink it's fairly obvious the design tradeoff of the header wouldbeat out UTF-8 for all but a few degenerate cases, but maybe youdon't see it.

And that's assuming we have a sane header-based encoding forstrings that contain segments in multiple languages in thefirst place. Linguistic analysis articles, for example, wouldeasily contain many such segments within a paragraph, orperhaps in the same sentence. How would a header-based encodingwork for such documents?

It would bloat the header to some extent, but less so than aUTF-8 string. You may want to use special header encodings forsuch edge cases too, if you want to maintain the same largeperformance lead over UTF-8 that you'd have for the common case.

Nevermind the recent trend of
liberally sprinkling emojis all over regular text. If everyemoticon embedded in a string requires splitting the stringinto 3 segments complete with their own headers, I dare notimagine what the code that manipulates such strings would looklike.

Personally, I don't consider emojis worth implementing, :) theyshouldn't be part of Unicode. But since they are, I'm fairlycertain header-based text messages with emojis would besignificantly smaller than using UTF-8/16.

I was surprised to see that adding a emoji to a text message Isent last year cut my message character quota in half. I googledwhy this was and it turns out that when you add an emoji, thetext messaging client actually changes your message encoding fromUTF-8 to UTF-16! I don't know if this is a limitation of thedefault Android messaging client, my telco carrier, or SMS, but Istrongly suspect this is widespread.

Anyway, I can see the arguments about UTF-8 this time around areas bad as the first time I raised it five years back, so I'llleave this thread here.

Re: Of possible interest: fast UTF8 validation

Reply via email to