On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.

It sounds like the main issue is that a header based encoding would take less size?

If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.

Indeed, and some other compression/deduplication options that would allow limited random access / slicing (by decoding a single “block” to access an element for instance).

Anything that depends on external information and is not self-sync is awful for interchange. Internally the application can do some smarts though, but even then things like interning (partial interning) might be more valuable approach. TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.

Reply via email to