Re: Of possible interest: fast UTF8 validation

Joakim via Digitalmars-d Thu, 17 May 2018 11:37:15 -0700

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:

On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pagesand then added these new transfer formats, but I have longthought that they'd have been better off going with aheader-based format that kept most languages in a single-bytescheme, as they mostly were except for obviously the Asian CJKlanguages. That way, you optimize for the common string, ieone that contains a single language or at least no CJK, ratherthan pessimizing every non-ASCII language by doubling itscharacter width, as UTF-8 does. This UTF-8 issue is one of thefirst topics I raised in this forum, but as you noted at thetime nobody agreed and I don't want to dredge that all upagain.
It sounds like the main issue is that a header based encodingwould take less size?


Yes, and be easier to process.

If that's correct, then I hypothesize that adding an LZWcompression layer would achieve the same or better result.

In general, you would be wrong, a carefully designed binaryformat will usually beat the pants off general-purposecompression:


https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results

Of course, that's because you can tailor your binary format forspecific types of data, text in this case, and take advantage ofpatterns in that subset, such as specialized image compressionformats do. In this case though, I haven't compared this schemeto general compression of UTF-8 strings, so I don't know whichwould compress better.

However, that would mostly matter for network transmission,another big gain of a header-based scheme that doesn't usecompression is much faster string processing in memory. Yes, theaverage end user doesn't care for this, but giant consumers oftext data, like search engines, would benefit greatly from this.


On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:

Indeed, and some other compression/deduplication options thatwould allow limited random access / slicing (by decoding asingle “block” to access an element for instance).

Possibly competitive for compression only for transmission overthe network, but unlikely for processing, as noted for Walter'sidea.

Anything that depends on external information and is notself-sync is awful for interchange.

You are describing the vast majority of all formats andprotocols, amazing how we got by with them all this time.

Internally the application can do some smarts though, but eventhen things like interning (partial interning) might be morevaluable approach. TCP being reliable just plain doesn’t cutit. Corruption of single bit is very real.

You seem to have missed my point entirely: UTF-8 will not catchmost bit flips either, only if it happens to corrupt certain keybits in a certain way, a minority of the possibilities. Nobody isarguing that data corruption doesn't happen or that someerror-correction shouldn't be done somewhere.

The question is whether the extremely limited robustness of UTF-8added by its significant redundancy is a good tradeoff. I thinkit's obvious that it isn't, and I posit that anybody who knowsanything about error-correcting codes would agree with thatassessment. You would be much better off by having a more compactheader-based transfer format and layering on the level of errorcorrection you need at a different level, which as I noted isalready done at the link and transport layers and various otherparts of the system already.

If you need more error-correction than that, do it right, not ina broken way as UTF-8 does. Honestly, error detection/correctionis the most laughably broken part of UTF-8, it is amazing thatpeople even bring that up as a benefit.

Re: Of possible interest: fast UTF8 validation

Reply via email to