Re: [gentoo-user] ncurses; I think I wrecked my fresh install

antlists Wed, 30 Dec 2020 09:42:59 -0800

On 30/12/2020 16:35, Andreas K. Huettel wrote:

   I don't know if this has improved over the years, but my initial
experience with unicode was rather negative.  The fact that text
files were twice as large wasn't a major problem in itself.  The
real showstopper was that importing text files into spreadsheets
and text-editors and word processors failed miseraby.


   I looked at a unicode text file with a binary viewer.  It turns out
that a simple text string like "1234" was actually...
"1" binary-zero "2" binary-zero "3" binary-zero "4" binary zero, etc.


That's (as someone has already pointed out) UTF-16, which is the default for
some Windows tools (but understood in Linux too). (Even UTF-32 exists where
all characters are 4 byte wide, but I've never seen it in the wild.)

UTF-8 is normally used on Linux (and ASCII chars look exactly the same there);
even for "long characters" outside the ASCII range spreadsheets and word
processors should not be a problem anymore.

Following up on my previous answer, you need to separate in your mindUTF the character set, and UTF-x the representation. When UTF wasintroduced MS - in accordance with the thoughts of the time - thoughtthe future was a 16-bit char, which can store 32 thousand characters.(Note that, BY DEFINITION, the high bit of a UTF character *must* bezero. Just like standard ASCII.)

So MS and Windows uses UTF-16 as its encoding. Unix LATER went down theroute of UTF-8 which - I think - can only encode 16 thousand charactersin two bytes, but because most (western) text does encode successfullyin one byte is actually a major saving in network operations such asemail, web etc which is where Unix has traditionally been very strong.

But UTF-16 works very well for MS, because they are primarily desktop,and UTF-16 means that there are very few multi-char characters. Thatreduces pressure on CPU, which is a desktop-limited resource.

And lastly, very importantly, given that AT PRESENT all characters canbe encoded in 31 bits, UTF-32 the representation is equivalent to UTFthe character set. But should we need more than 2 billion characters,there is nothing stopping us rolling out characters encoded in two32-bit chars, and UTF-64.


Cheers,
Wol

Re: [gentoo-user] ncurses; I think I wrecked my fresh install

Reply via email to