On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
On 5/16/2018 10:01 PM, Joakim wrote:
Unicode was a standardization of all the existing code pages and then added these new transfer formats, but I have long thought that they'd have been better off going with a header-based format that kept most languages in a single-byte scheme, as they mostly were except for obviously the Asian CJK languages. That way, you optimize for the common string, ie one that contains a single language or at least no CJK, rather than pessimizing every non-ASCII language by doubling its character width, as UTF-8 does. This UTF-8 issue is one of the first topics I raised in this forum, but as you noted at the time nobody agreed and I don't want to dredge that all up again.

It sounds like the main issue is that a header based encoding would take less size?

Yes, and be easier to process.

If that's correct, then I hypothesize that adding an LZW compression layer would achieve the same or better result.

In general, you would be wrong, a carefully designed binary format will usually beat the pants off general-purpose compression:

https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results

Of course, that's because you can tailor your binary format for specific types of data, text in this case, and take advantage of patterns in that subset, such as specialized image compression formats do. In this case though, I haven't compared this scheme to general compression of UTF-8 strings, so I don't know which would compress better.

However, that would mostly matter for network transmission, another big gain of a header-based scheme that doesn't use compression is much faster string processing in memory. Yes, the average end user doesn't care for this, but giant consumers of text data, like search engines, would benefit greatly from this.

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
Indeed, and some other compression/deduplication options that would allow limited random access / slicing (by decoding a single “block” to access an element for instance).

Possibly competitive for compression only for transmission over the network, but unlikely for processing, as noted for Walter's idea.

Anything that depends on external information and is not self-sync is awful for interchange.

You are describing the vast majority of all formats and protocols, amazing how we got by with them all this time.

Internally the application can do some smarts though, but even then things like interning (partial interning) might be more valuable approach. TCP being reliable just plain doesn’t cut it. Corruption of single bit is very real.

You seem to have missed my point entirely: UTF-8 will not catch most bit flips either, only if it happens to corrupt certain key bits in a certain way, a minority of the possibilities. Nobody is arguing that data corruption doesn't happen or that some error-correction shouldn't be done somewhere.

The question is whether the extremely limited robustness of UTF-8 added by its significant redundancy is a good tradeoff. I think it's obvious that it isn't, and I posit that anybody who knows anything about error-correcting codes would agree with that assessment. You would be much better off by having a more compact header-based transfer format and layering on the level of error correction you need at a different level, which as I noted is already done at the link and transport layers and various other parts of the system already.

If you need more error-correction than that, do it right, not in a broken way as UTF-8 does. Honestly, error detection/correction is the most laughably broken part of UTF-8, it is amazing that people even bring that up as a benefit.

Reply via email to