Re: Of possible interest: fast UTF8 validation

H. S. Teoh via Digitalmars-d Thu, 17 May 2018 12:09:56 -0700

On Thu, May 17, 2018 at 10:16:03AM -0700, Walter Bright via Digitalmars-d wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
> > Unicode was a standardization of all the existing code pages and
> > then added these new transfer formats, but I have long thought that
> > they'd have been better off going with a header-based format that
> > kept most languages in a single-byte scheme, as they mostly were
> > except for obviously the Asian CJK languages. That way, you optimize
> > for the common string, ie one that contains a single language or at
> > least no CJK, rather than pessimizing every non-ASCII language by
> > doubling its character width, as UTF-8 does. This UTF-8 issue is one
> > of the first topics I raised in this forum, but as you noted at the
> > time nobody agreed and I don't want to dredge that all up again.
> 
> It sounds like the main issue is that a header based encoding would
> take less size?
> 
> If that's correct, then I hypothesize that adding an LZW compression
> layer would achieve the same or better result.


My bet is on the LZW being *far* better than a header-based encoding.
Natural language, which a large part of textual data consists of, tends
to have a lot of built-in redundancy, and therefore is highly
compressible.  A proper compression algorithm will beat any header-based
size reduction scheme, while still maintaining the context-free nature
of UTF-8.


T

-- 
In a world without fences, who needs Windows and Gates? -- Christian Surchi

Re: Of possible interest: fast UTF8 validation

Reply via email to