Re: Of possible interest: fast UTF8 validation

Patrick Schluter via Digitalmars-d Thu, 17 May 2018 12:16:32 -0700

On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:

On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluterwrote:
This is not practical, sorry. What happens when your messageloses the header? Exactly, the rest of the message is garbled.
Why would it lose the header? TCP guarantees delivery andchecksums the data, that's effective enough at the transportlayer.

What does TCP/IP got to do with anything in discussion here.UTF-8 (or UTF-16 or UTF-32) has nothing to do with networkprotocols. That's completely unrelated. A file encoded on a diskmay never leave the machine it is written on and may never see awire in its lifetime and its encoding is still of vitalimportance. That's why a header encoding is too restrictive.

I agree that UTF-8 is a more redundant format, as others havementioned earlier, and is thus more robust to certain types ofdata loss than a header-based scheme. However, I don't considerthat the job of the text format, it's better done by otherlayers, like transport protocols or filesystems, which willguard against such losses much more reliably and efficiently.

No. A text format cannot depend on a network protocol. It wouldbe as if you could only listen to a music file or a video onstreaming and never save it on offline file as there was nowherethe information of what that blob of bytes represents. It doesn'tmake any sense.

For example, a random bitflip somewhere in the middle of aUTF-8 string will not be detectable most of the time. However,more robust error-correcting schemes at other layers of thesystem will easily catch that.

That's the job of the other layers. Any other file would have thesame problem. At least, with utf-8 there will be at most onlyever 1 codepoint lost or changed. Any other encoding would farebetter. This said if a checksum header for your document isimportant you can add it to externally anyway.

That's exactly what happened with code page based texts whenyou don't know in which code page it is encoded. It has thesupplemental inconvenience that mixing languages becomesimpossible or at least very cumbersome.UTF-8 has several properties that are difficult to have withother schemes.- It is state-less, means any byte in a stream always meansthe same thing. Its meaning does not depend on external or aprevious byte.
I realize this was considered important at one time, but Ithink it has proven to be a bad design decision, for HTTP too.There are some advantages when building rudimentary systemswith crude hardware and lots of noise, as was the case backthen, but that's not the tech world we live in today. That'swhy almost every HTTP request today is part of a statefulsession that explicitly keeps track of the connection, whetherthrough cookies, https encryption, or HTTP/2.

Again, orthogonal to utf-8. When I speak above of streams itdoesn't limit to sockets, file are also read in streams. So stopof equating UTF-8 with the Internet, these are 2 differentdomains. Internet and its protocols were defined and inventedlong before Unicode and Unicode is very usefull also offline.

- It can mix any language in the same stream withoutacrobatics and if one thinks that mixing languages doesn'thappen often should get his head extracted from his rear,because it is very common (check wikipedia's front page forexample).
I question that almost anybody needs to mix "streams." As formessages or files, headers handle multiple language mixingeasily, as noted in that earlier thread.


Ok, show me how you transmit that, I'm curious:

<prop type="Txt::Doc. No.">E2010C0002</prop>
<tuv lang="EN-GB">
<seg>EFTA Surveillance Authority Decision</seg>
</tuv>
<tuv lang="DE-DE">
<seg>Beschluss der EFTA-Überwachungsbehörde</seg>
</tuv>
<tuv lang="DA-01">
<seg>EFTA-Tilsynsmyndighedens beslutning</seg>
</tuv>
<tuv lang="EL-01">
<seg>Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ</seg>
</tuv>
<tuv lang="ES-ES">
<seg>Decisión del Órgano de Vigilancia de la AELC</seg>
</tuv>
<tuv lang="FI-01">
<seg>EFTAn valvontaviranomaisen päätös</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Décision de l'Autorité de surveillance AELE</seg>
</tuv>
<tuv lang="IT-IT">
<seg>Decisione dell’Autorità di vigilanza EFTA</seg>
</tuv>
<tuv lang="NL-NL">
<seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg>
</tuv>
<tuv lang="PT-PT">
<seg>Decisão do Órgão de Fiscalização da EFTA</seg>
</tuv>
<tuv lang="SV-SE">
<seg>Beslut av Eftas övervakningsmyndighet</seg>
</tuv>
<tuv lang="LV-01">
<seg>EBTA Uzraudzības iestādes Lēmums</seg>
</tuv>
<tuv lang="CS-01">
<seg>Rozhodnutí Kontrolního úřadu ESVO</seg>
</tuv>
<tuv lang="ET-01">
<seg>EFTA järelevalveameti otsus</seg>
</tuv>
<tuv lang="PL-01">
<seg>Decyzja Urzędu Nadzoru EFTA</seg>
</tuv>
<tuv lang="SL-01">
<seg>Odločba Nadzornega organa EFTE</seg>
</tuv>
<tuv lang="LT-01">
<seg>ELPA priežiūros institucijos sprendimas</seg>
</tuv>
<tuv lang="MT-01">
<seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg>
</tuv>
<tuv lang="SK-01">
<seg>Rozhodnutie Dozorného orgánu EZVO</seg>
</tuv>
<tuv lang="BG-01">
<seg>Решение на Надзорния орган на ЕАСТ</seg>
</tuv>
</tu>
<tu>

- The multi byte nature of other alphabets is not as bad aspeople think because texts in computer do not live on theirown, meaning that they are generally embedded inside fileformats, which more often than not are extremely bloated (xml,html, xliff, akoma ntoso, rtf etc.). The few bytes more in thetext do not weigh that much.
Heh, the other parts of the tech stack are much more bloated,so this bloat is okay? A unique argument, but I'd argue that'swhy those bloated formats you mention are largely dying off too.

They don't, it's getting worse by the day, that's why I mentionedAkoma Ntoso and XLIFF, they will be used more and more. The worldis not limited to webshit (see n-gate.com for the reference).

I'm in charge at the European Commission of the biggesttranslation memory in the world. It handles currently 30languages and without UTF-8 and UTF-16 it would beunmanageable. I still remember when I started there in 2002when we handled only 11 languages of which only 1 was ofanother alphabet (Greek). Everything was based on RTF withcodepages and it was a braindead mess. My first job in 2003was to extend the system to handle the 8 newcomer languagesand with ASCII based encodings it was completely unmanageablebecause every document processed mixes languages and alphabetsfreely (addresses and names are often written in theiroriginal form for instance).
I have no idea what a "translation memory" is. I don't doubtthat dealing with non-standard codepages or layouts wasdifficult, and that a standard like Unicode made your lifeeasier. But the question isn't whether standards would cleanthings up, of course they would, the question is whether ahypothetical header-based standard would be better than thecurrent continuation byte standard, UTF-8. I think your lifewould've been even easier with the former, though depending onyour usage, maybe the main gain for you would be just fromstandardization.

I doubt it because the issue has nothing to do with networkprotocols as you seem to imply. It is about data format, i.e. thecontent that may be shuffled over a net, but can also stay on adisk, be printed on paper (gasp so old tech) or usedinteractively in a GUI.

2 years ago we implemented also support for Chinese. The nicething was that we didn't have to change much to do that thanksto Unicode. The second surprise was with the file sizes,Chinese documents were generally smaller than their Europeancounterparts. Yes CJK requires 3 bytes for each ideogram, butgenerally 1 ideogram replaces many letters. The ideogram 亿replaces "One hundred million" for example, which of them takemore bytes? So if CJK indeed requires more bytes to encode, itis firstly because they NEED many more bits in the first place(there are around 30000 CJK codepoints in the BMP alone, addto it the 60000 that are in the SIP and we have a need of 17bits only to encode them.
That's not the relevant criteria: nobody cares if the CJKdocuments were smaller than their European counterparts. Whatthey care about is that, given a different transfer format, theCJK document could have been significantly smaller still.Because almost nobody cares about which translation version issmaller, they care that the text they sent in Chinese or Koreanis as small as it can be.

At most 50% more but if the size is really that important it canuse UTF-16 which is the same size as Big-5 or Shit-JIS, or asWalter suggested they would better compress the file in that case.

Anyway, I didn't mean to restart this debate, so I'll leave ithere.


- the auto-synchronization and the statelessness are big deals.

Re: Of possible interest: fast UTF8 validation

Reply via email to