Re: Of possible interest: fast UTF8 validation

Dmitry Olshansky via Digitalmars-d Thu, 17 May 2018 10:30:44 -0700

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:

On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pagesand then added these new transfer formats, but I have longthought that they'd have been better off going with aheader-based format that kept most languages in a single-bytescheme, as they mostly were except for obviously the Asian CJKlanguages. That way, you optimize for the common string, ieone that contains a single language or at least no CJK, ratherthan pessimizing every non-ASCII language by doubling itscharacter width, as UTF-8 does. This UTF-8 issue is one of thefirst topics I raised in this forum, but as you noted at thetime nobody agreed and I don't want to dredge that all upagain.
It sounds like the main issue is that a header based encodingwould take less size?
If that's correct, then I hypothesize that adding an LZWcompression layer would achieve the same or better result.

Indeed, and some other compression/deduplication options thatwould allow limited random access / slicing (by decoding a single“block” to access an element for instance).

Anything that depends on external information and is notself-sync is awful for interchange. Internally the applicationcan do some smarts though, but even then things like interning(partial interning) might be more valuable approach. TCP beingreliable just plain doesn’t cut it. Corruption of single bit isvery real.

Re: Of possible interest: fast UTF8 validation

Reply via email to