On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:
On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
TCP being  reliable just plain doesn’t cut it. Corruption of
single bit is very real.

Quoting to highlight and agree.

TCP is reliable because it resends dropped packets and delivers them in order.

I don't write TCP packets to my long-term storage medium.

UTF as a transportation protocol Unicode is *far* more useful than just sending across a network.

The point wasn't that TCP is handling all the errors, it was a throwaway example of one other layer of the system, the network transport layer, that actually has a checksum that will detect a single bitflip, which UTF-8 will not usually detect. I mentioned that the filesystem and several other layers have their own such error detection, yet you guys crazily latch on to the TCP example alone.

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via Digitalmars-d wrote: [...]
- the auto-synchronization and the statelessness are big deals.

Yes. Imagine if we standardized on a header-based string encoding, and we wanted to implement a substring function over a string that contains multiple segments of different languages. Instead of a cheap slicing over the string, you'd need to scan the string or otherwise keep track of which segment the start/end of the substring lies in, allocate memory to insert headers so that the segments are properly interpreted, etc.. It would be an implementational nightmare, and an unavoidable performance hit (you'd have to copy data every time you take a substring), and the @nogc guys would be up in arms.

As we discussed when I first raised this header scheme years ago, you're right that slicing could be more expensive, depending on whether you chose to allocate a new header for the substring or not. The question is whether the optimizations available from such a header telling you where all the language substrings are in a multi-language string make up for having to expensively process the entire UTF-8 string to get that or other data. I think it's fairly obvious the design tradeoff of the header would beat out UTF-8 for all but a few degenerate cases, but maybe you don't see it.

And that's assuming we have a sane header-based encoding for strings that contain segments in multiple languages in the first place. Linguistic analysis articles, for example, would easily contain many such segments within a paragraph, or perhaps in the same sentence. How would a header-based encoding work for such documents?

It would bloat the header to some extent, but less so than a UTF-8 string. You may want to use special header encodings for such edge cases too, if you want to maintain the same large performance lead over UTF-8 that you'd have for the common case.

Nevermind the recent trend of
liberally sprinkling emojis all over regular text. If every emoticon embedded in a string requires splitting the string into 3 segments complete with their own headers, I dare not imagine what the code that manipulates such strings would look like.

Personally, I don't consider emojis worth implementing, :) they shouldn't be part of Unicode. But since they are, I'm fairly certain header-based text messages with emojis would be significantly smaller than using UTF-8/16.

I was surprised to see that adding a emoji to a text message I sent last year cut my message character quota in half. I googled why this was and it turns out that when you add an emoji, the text messaging client actually changes your message encoding from UTF-8 to UTF-16! I don't know if this is a limitation of the default Android messaging client, my telco carrier, or SMS, but I strongly suspect this is widespread.

Anyway, I can see the arguments about UTF-8 this time around are as bad as the first time I raised it five years back, so I'll leave this thread here.

Reply via email to