On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.

It sounds like the main issue is that a header based encoding would take less 
size?

If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.

Reply via email to