On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:
On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter
wrote:
This is not practical, sorry. What happens when your message
loses the header? Exactly, the rest of the message is garbled.
Why would it lose the header? TCP guarantees delivery and
checksums the data, that's effective enough at the transport
layer.
What does TCP/IP got to do with anything in discussion here.
UTF-8 (or UTF-16 or UTF-32) has nothing to do with network
protocols. That's completely unrelated. A file encoded on a disk
may never leave the machine it is written on and may never see a
wire in its lifetime and its encoding is still of vital
importance. That's why a header encoding is too restrictive.
I agree that UTF-8 is a more redundant format, as others have
mentioned earlier, and is thus more robust to certain types of
data loss than a header-based scheme. However, I don't consider
that the job of the text format, it's better done by other
layers, like transport protocols or filesystems, which will
guard against such losses much more reliably and efficiently.
No. A text format cannot depend on a network protocol. It would
be as if you could only listen to a music file or a video on
streaming and never save it on offline file as there was nowhere
the information of what that blob of bytes represents. It doesn't
make any sense.
For example, a random bitflip somewhere in the middle of a
UTF-8 string will not be detectable most of the time. However,
more robust error-correcting schemes at other layers of the
system will easily catch that.
That's the job of the other layers. Any other file would have the
same problem. At least, with utf-8 there will be at most only
ever 1 codepoint lost or changed. Any other encoding would fare
better. This said if a checksum header for your document is
important you can add it to externally anyway.
That's exactly what happened with code page based texts when
you don't know in which code page it is encoded. It has the
supplemental inconvenience that mixing languages becomes
impossible or at least very cumbersome.
UTF-8 has several properties that are difficult to have with
other schemes.
- It is state-less, means any byte in a stream always means
the same thing. Its meaning does not depend on external or a
previous byte.
I realize this was considered important at one time, but I
think it has proven to be a bad design decision, for HTTP too.
There are some advantages when building rudimentary systems
with crude hardware and lots of noise, as was the case back
then, but that's not the tech world we live in today. That's
why almost every HTTP request today is part of a stateful
session that explicitly keeps track of the connection, whether
through cookies, https encryption, or HTTP/2.
Again, orthogonal to utf-8. When I speak above of streams it
doesn't limit to sockets, file are also read in streams. So stop
of equating UTF-8 with the Internet, these are 2 different
domains. Internet and its protocols were defined and invented
long before Unicode and Unicode is very usefull also offline.
- It can mix any language in the same stream without
acrobatics and if one thinks that mixing languages doesn't
happen often should get his head extracted from his rear,
because it is very common (check wikipedia's front page for
example).
I question that almost anybody needs to mix "streams." As for
messages or files, headers handle multiple language mixing
easily, as noted in that earlier thread.
Ok, show me how you transmit that, I'm curious:
<prop type="Txt::Doc. No.">E2010C0002</prop>
<tuv lang="EN-GB">
<seg>EFTA Surveillance Authority Decision</seg>
</tuv>
<tuv lang="DE-DE">
<seg>Beschluss der EFTA-Überwachungsbehörde</seg>
</tuv>
<tuv lang="DA-01">
<seg>EFTA-Tilsynsmyndighedens beslutning</seg>
</tuv>
<tuv lang="EL-01">
<seg>Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ</seg>
</tuv>
<tuv lang="ES-ES">
<seg>Decisión del Órgano de Vigilancia de la AELC</seg>
</tuv>
<tuv lang="FI-01">
<seg>EFTAn valvontaviranomaisen päätös</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Décision de l'Autorité de surveillance AELE</seg>
</tuv>
<tuv lang="IT-IT">
<seg>Decisione dell’Autorità di vigilanza EFTA</seg>
</tuv>
<tuv lang="NL-NL">
<seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg>
</tuv>
<tuv lang="PT-PT">
<seg>Decisão do Órgão de Fiscalização da EFTA</seg>
</tuv>
<tuv lang="SV-SE">
<seg>Beslut av Eftas övervakningsmyndighet</seg>
</tuv>
<tuv lang="LV-01">
<seg>EBTA Uzraudzības iestādes Lēmums</seg>
</tuv>
<tuv lang="CS-01">
<seg>Rozhodnutí Kontrolního úřadu ESVO</seg>
</tuv>
<tuv lang="ET-01">
<seg>EFTA järelevalveameti otsus</seg>
</tuv>
<tuv lang="PL-01">
<seg>Decyzja Urzędu Nadzoru EFTA</seg>
</tuv>
<tuv lang="SL-01">
<seg>Odločba Nadzornega organa EFTE</seg>
</tuv>
<tuv lang="LT-01">
<seg>ELPA priežiūros institucijos sprendimas</seg>
</tuv>
<tuv lang="MT-01">
<seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg>
</tuv>
<tuv lang="SK-01">
<seg>Rozhodnutie Dozorného orgánu EZVO</seg>
</tuv>
<tuv lang="BG-01">
<seg>Решение на Надзорния орган на ЕАСТ</seg>
</tuv>
</tu>
<tu>
- The multi byte nature of other alphabets is not as bad as
people think because texts in computer do not live on their
own, meaning that they are generally embedded inside file
formats, which more often than not are extremely bloated (xml,
html, xliff, akoma ntoso, rtf etc.). The few bytes more in the
text do not weigh that much.
Heh, the other parts of the tech stack are much more bloated,
so this bloat is okay? A unique argument, but I'd argue that's
why those bloated formats you mention are largely dying off too.
They don't, it's getting worse by the day, that's why I mentioned
Akoma Ntoso and XLIFF, they will be used more and more. The world
is not limited to webshit (see n-gate.com for the reference).
I'm in charge at the European Commission of the biggest
translation memory in the world. It handles currently 30
languages and without UTF-8 and UTF-16 it would be
unmanageable. I still remember when I started there in 2002
when we handled only 11 languages of which only 1 was of
another alphabet (Greek). Everything was based on RTF with
codepages and it was a braindead mess. My first job in 2003
was to extend the system to handle the 8 newcomer languages
and with ASCII based encodings it was completely unmanageable
because every document processed mixes languages and alphabets
freely (addresses and names are often written in their
original form for instance).
I have no idea what a "translation memory" is. I don't doubt
that dealing with non-standard codepages or layouts was
difficult, and that a standard like Unicode made your life
easier. But the question isn't whether standards would clean
things up, of course they would, the question is whether a
hypothetical header-based standard would be better than the
current continuation byte standard, UTF-8. I think your life
would've been even easier with the former, though depending on
your usage, maybe the main gain for you would be just from
standardization.
I doubt it because the issue has nothing to do with network
protocols as you seem to imply. It is about data format, i.e. the
content that may be shuffled over a net, but can also stay on a
disk, be printed on paper (gasp so old tech) or used
interactively in a GUI.
2 years ago we implemented also support for Chinese. The nice
thing was that we didn't have to change much to do that thanks
to Unicode. The second surprise was with the file sizes,
Chinese documents were generally smaller than their European
counterparts. Yes CJK requires 3 bytes for each ideogram, but
generally 1 ideogram replaces many letters. The ideogram 亿
replaces "One hundred million" for example, which of them take
more bytes? So if CJK indeed requires more bytes to encode, it
is firstly because they NEED many more bits in the first place
(there are around 30000 CJK codepoints in the BMP alone, add
to it the 60000 that are in the SIP and we have a need of 17
bits only to encode them.
That's not the relevant criteria: nobody cares if the CJK
documents were smaller than their European counterparts. What
they care about is that, given a different transfer format, the
CJK document could have been significantly smaller still.
Because almost nobody cares about which translation version is
smaller, they care that the text they sent in Chinese or Korean
is as small as it can be.
At most 50% more but if the size is really that important it can
use UTF-16 which is the same size as Big-5 or Shit-JIS, or as
Walter suggested they would better compress the file in that case.
Anyway, I didn't mean to restart this debate, so I'll leave it
here.
- the auto-synchronization and the statelessness are big deals.