On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:
This is not practical, sorry. What happens when your message loses the header? Exactly, the rest of the message is garbled.

Why would it lose the header? TCP guarantees delivery and checksums the data, that's effective enough at the transport layer.

I agree that UTF-8 is a more redundant format, as others have mentioned earlier, and is thus more robust to certain types of data loss than a header-based scheme. However, I don't consider that the job of the text format, it's better done by other layers, like transport protocols or filesystems, which will guard against such losses much more reliably and efficiently.

For example, a random bitflip somewhere in the middle of a UTF-8 string will not be detectable most of the time. However, more robust error-correcting schemes at other layers of the system will easily catch that.

That's exactly what happened with code page based texts when you don't know in which code page it is encoded. It has the supplemental inconvenience that mixing languages becomes impossible or at least very cumbersome. UTF-8 has several properties that are difficult to have with other schemes. - It is state-less, means any byte in a stream always means the same thing. Its meaning does not depend on external or a previous byte.

I realize this was considered important at one time, but I think it has proven to be a bad design decision, for HTTP too. There are some advantages when building rudimentary systems with crude hardware and lots of noise, as was the case back then, but that's not the tech world we live in today. That's why almost every HTTP request today is part of a stateful session that explicitly keeps track of the connection, whether through cookies, https encryption, or HTTP/2.

- It can mix any language in the same stream without acrobatics and if one thinks that mixing languages doesn't happen often should get his head extracted from his rear, because it is very common (check wikipedia's front page for example).

I question that almost anybody needs to mix "streams." As for messages or files, headers handle multiple language mixing easily, as noted in that earlier thread.

- The multi byte nature of other alphabets is not as bad as people think because texts in computer do not live on their own, meaning that they are generally embedded inside file formats, which more often than not are extremely bloated (xml, html, xliff, akoma ntoso, rtf etc.). The few bytes more in the text do not weigh that much.

Heh, the other parts of the tech stack are much more bloated, so this bloat is okay? A unique argument, but I'd argue that's why those bloated formats you mention are largely dying off too.

I'm in charge at the European Commission of the biggest translation memory in the world. It handles currently 30 languages and without UTF-8 and UTF-16 it would be unmanageable. I still remember when I started there in 2002 when we handled only 11 languages of which only 1 was of another alphabet (Greek). Everything was based on RTF with codepages and it was a braindead mess. My first job in 2003 was to extend the system to handle the 8 newcomer languages and with ASCII based encodings it was completely unmanageable because every document processed mixes languages and alphabets freely (addresses and names are often written in their original form for instance).

I have no idea what a "translation memory" is. I don't doubt that dealing with non-standard codepages or layouts was difficult, and that a standard like Unicode made your life easier. But the question isn't whether standards would clean things up, of course they would, the question is whether a hypothetical header-based standard would be better than the current continuation byte standard, UTF-8. I think your life would've been even easier with the former, though depending on your usage, maybe the main gain for you would be just from standardization.

2 years ago we implemented also support for Chinese. The nice thing was that we didn't have to change much to do that thanks to Unicode. The second surprise was with the file sizes, Chinese documents were generally smaller than their European counterparts. Yes CJK requires 3 bytes for each ideogram, but generally 1 ideogram replaces many letters. The ideogram 亿 replaces "One hundred million" for example, which of them take more bytes? So if CJK indeed requires more bytes to encode, it is firstly because they NEED many more bits in the first place (there are around 30000 CJK codepoints in the BMP alone, add to it the 60000 that are in the SIP and we have a need of 17 bits only to encode them.

That's not the relevant criteria: nobody cares if the CJK documents were smaller than their European counterparts. What they care about is that, given a different transfer format, the CJK document could have been significantly smaller still. Because almost nobody cares about which translation version is smaller, they care that the text they sent in Chinese or Korean is as small as it can be.

Anyway, I didn't mean to restart this debate, so I'll leave it here.

Reply via email to