Re: Of possible interest: fast UTF8 validation

Joakim via Digitalmars-d Thu, 17 May 2018 08:20:58 -0700

On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:

This is not practical, sorry. What happens when your messageloses the header? Exactly, the rest of the message is garbled.

Why would it lose the header? TCP guarantees delivery andchecksums the data, that's effective enough at the transportlayer.

I agree that UTF-8 is a more redundant format, as others havementioned earlier, and is thus more robust to certain types ofdata loss than a header-based scheme. However, I don't considerthat the job of the text format, it's better done by otherlayers, like transport protocols or filesystems, which will guardagainst such losses much more reliably and efficiently.

For example, a random bitflip somewhere in the middle of a UTF-8string will not be detectable most of the time. However, morerobust error-correcting schemes at other layers of the systemwill easily catch that.

That's exactly what happened with code page based texts whenyou don't know in which code page it is encoded. It has thesupplemental inconvenience that mixing languages becomesimpossible or at least very cumbersome.UTF-8 has several properties that are difficult to have withother schemes.- It is state-less, means any byte in a stream always means thesame thing. Its meaning does not depend on external or aprevious byte.

I realize this was considered important at one time, but I thinkit has proven to be a bad design decision, for HTTP too. Thereare some advantages when building rudimentary systems with crudehardware and lots of noise, as was the case back then, but that'snot the tech world we live in today. That's why almost every HTTPrequest today is part of a stateful session that explicitly keepstrack of the connection, whether through cookies, httpsencryption, or HTTP/2.

- It can mix any language in the same stream without acrobaticsand if one thinks that mixing languages doesn't happen oftenshould get his head extracted from his rear, because it is verycommon (check wikipedia's front page for example).

I question that almost anybody needs to mix "streams." As formessages or files, headers handle multiple language mixingeasily, as noted in that earlier thread.

- The multi byte nature of other alphabets is not as bad aspeople think because texts in computer do not live on theirown, meaning that they are generally embedded inside fileformats, which more often than not are extremely bloated (xml,html, xliff, akoma ntoso, rtf etc.). The few bytes more in thetext do not weigh that much.

Heh, the other parts of the tech stack are much more bloated, sothis bloat is okay? A unique argument, but I'd argue that's whythose bloated formats you mention are largely dying off too.

I'm in charge at the European Commission of the biggesttranslation memory in the world. It handles currently 30languages and without UTF-8 and UTF-16 it would beunmanageable. I still remember when I started there in 2002when we handled only 11 languages of which only 1 was ofanother alphabet (Greek). Everything was based on RTF withcodepages and it was a braindead mess. My first job in 2003 wasto extend the system to handle the 8 newcomer languages andwith ASCII based encodings it was completely unmanageablebecause every document processed mixes languages and alphabetsfreely (addresses and names are often written in their originalform for instance).

I have no idea what a "translation memory" is. I don't doubt thatdealing with non-standard codepages or layouts was difficult, andthat a standard like Unicode made your life easier. But thequestion isn't whether standards would clean things up, of coursethey would, the question is whether a hypothetical header-basedstandard would be better than the current continuation bytestandard, UTF-8. I think your life would've been even easier withthe former, though depending on your usage, maybe the main gainfor you would be just from standardization.

2 years ago we implemented also support for Chinese. The nicething was that we didn't have to change much to do that thanksto Unicode. The second surprise was with the file sizes,Chinese documents were generally smaller than their Europeancounterparts. Yes CJK requires 3 bytes for each ideogram, butgenerally 1 ideogram replaces many letters. The ideogram 亿replaces "One hundred million" for example, which of them takemore bytes? So if CJK indeed requires more bytes to encode, itis firstly because they NEED many more bits in the first place(there are around 30000 CJK codepoints in the BMP alone, add toit the 60000 that are in the SIP and we have a need of 17 bitsonly to encode them.

That's not the relevant criteria: nobody cares if the CJKdocuments were smaller than their European counterparts. Whatthey care about is that, given a different transfer format, theCJK document could have been significantly smaller still. Becausealmost nobody cares about which translation version is smaller,they care that the text they sent in Chinese or Korean is assmall as it can be.

Anyway, I didn't mean to restart this debate, so I'll leave ithere.

Re: Of possible interest: fast UTF8 validation

Reply via email to