On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages
and then added these new transfer formats, but I have long
thought that they'd have been better off going with a
header-based format that kept most languages in a single-byte
scheme, as they mostly were except for obviously the Asian CJK
languages. That way, you optimize for the common string, ie
one that contains a single language or at least no CJK, rather
than pessimizing every non-ASCII language by doubling its
character width, as UTF-8 does. This UTF-8 issue is one of the
first topics I raised in this forum, but as you noted at the
time nobody agreed and I don't want to dredge that all up
again.
It sounds like the main issue is that a header based encoding
would take less size?
If that's correct, then I hypothesize that adding an LZW
compression layer would achieve the same or better result.
Indeed, and some other compression/deduplication options that
would allow limited random access / slicing (by decoding a single
“block” to access an element for instance).
Anything that depends on external information and is not
self-sync is awful for interchange. Internally the application
can do some smarts though, but even then things like interning
(partial interning) might be more valuable approach. TCP being
reliable just plain doesn’t cut it. Corruption of single bit is
very real.