On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:
On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky
wrote:
TCP being reliable just plain doesn’t cut it. Corruption of
single bit is very real.
Quoting to highlight and agree.
TCP is reliable because it resends dropped packets and delivers
them in order.
I don't write TCP packets to my long-term storage medium.
UTF as a transportation protocol Unicode is *far* more useful
than just sending across a network.
The point wasn't that TCP is handling all the errors, it was a
throwaway example of one other layer of the system, the network
transport layer, that actually has a checksum that will detect a
single bitflip, which UTF-8 will not usually detect. I mentioned
that the filesystem and several other layers have their own such
error detection, yet you guys crazily latch on to the TCP example
alone.
On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via
Digitalmars-d wrote: [...]
- the auto-synchronization and the statelessness are big deals.
Yes. Imagine if we standardized on a header-based string
encoding, and we wanted to implement a substring function over
a string that contains multiple segments of different
languages. Instead of a cheap slicing over the string, you'd
need to scan the string or otherwise keep track of which
segment the start/end of the substring lies in, allocate memory
to insert headers so that the segments are properly
interpreted, etc.. It would be an implementational nightmare,
and an unavoidable performance hit (you'd have to copy data
every time you take a substring), and the @nogc guys would be
up in arms.
As we discussed when I first raised this header scheme years ago,
you're right that slicing could be more expensive, depending on
whether you chose to allocate a new header for the substring or
not. The question is whether the optimizations available from
such a header telling you where all the language substrings are
in a multi-language string make up for having to expensively
process the entire UTF-8 string to get that or other data. I
think it's fairly obvious the design tradeoff of the header would
beat out UTF-8 for all but a few degenerate cases, but maybe you
don't see it.
And that's assuming we have a sane header-based encoding for
strings that contain segments in multiple languages in the
first place. Linguistic analysis articles, for example, would
easily contain many such segments within a paragraph, or
perhaps in the same sentence. How would a header-based encoding
work for such documents?
It would bloat the header to some extent, but less so than a
UTF-8 string. You may want to use special header encodings for
such edge cases too, if you want to maintain the same large
performance lead over UTF-8 that you'd have for the common case.
Nevermind the recent trend of
liberally sprinkling emojis all over regular text. If every
emoticon embedded in a string requires splitting the string
into 3 segments complete with their own headers, I dare not
imagine what the code that manipulates such strings would look
like.
Personally, I don't consider emojis worth implementing, :) they
shouldn't be part of Unicode. But since they are, I'm fairly
certain header-based text messages with emojis would be
significantly smaller than using UTF-8/16.
I was surprised to see that adding a emoji to a text message I
sent last year cut my message character quota in half. I googled
why this was and it turns out that when you add an emoji, the
text messaging client actually changes your message encoding from
UTF-8 to UTF-16! I don't know if this is a limitation of the
default Android messaging client, my telco carrier, or SMS, but I
strongly suspect this is widespread.
Anyway, I can see the arguments about UTF-8 this time around are
as bad as the first time I raised it five years back, so I'll
leave this thread here.