On Thu, May 17, 2018 at 10:16:03AM -0700, Walter Bright via Digitalmars-d wrote: > On 5/16/2018 10:01 PM, Joakim wrote: > > Unicode was a standardization of all the existing code pages and > > then added these new transfer formats, but I have long thought that > > they'd have been better off going with a header-based format that > > kept most languages in a single-byte scheme, as they mostly were > > except for obviously the Asian CJK languages. That way, you optimize > > for the common string, ie one that contains a single language or at > > least no CJK, rather than pessimizing every non-ASCII language by > > doubling its character width, as UTF-8 does. This UTF-8 issue is one > > of the first topics I raised in this forum, but as you noted at the > > time nobody agreed and I don't want to dredge that all up again. > > It sounds like the main issue is that a header based encoding would > take less size? > > If that's correct, then I hypothesize that adding an LZW compression > layer would achieve the same or better result.
My bet is on the LZW being *far* better than a header-based encoding. Natural language, which a large part of textual data consists of, tends to have a lot of built-in redundancy, and therefore is highly compressible. A proper compression algorithm will beat any header-based size reduction scheme, while still maintaining the context-free nature of UTF-8. T -- In a world without fences, who needs Windows and Gates? -- Christian Surchi