On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages and then added
these new transfer formats, but I have long thought that they'd have been better
off going with a header-based format that kept most languages in a single-byte
scheme, as they mostly were except for obviously the Asian CJK languages. That
way, you optimize for the common string, ie one that contains a single language
or at least no CJK, rather than pessimizing every non-ASCII language by doubling
its character width, as UTF-8 does. This UTF-8 issue is one of the first topics
I raised in this forum, but as you noted at the time nobody agreed and I don't
want to dredge that all up again.
It sounds like the main issue is that a header based encoding would take less
size?
If that's correct, then I hypothesize that adding an LZW compression layer would
achieve the same or better result.