Re: utf-8, Latin 4, and basic unix commands

Markus Kuhn Tue, 17 Apr 2001 02:48:58 -0700
Trond Trosterud wrote on 2001-04-17 09:27 UTC:
> I have scanned through all info I have found, but still do not quite see
> how basic unix commands such as wc, sort, etc. handle UTF-8.

We just got support for UTF-8 locales into the GNU C library (glibc 2.2)
a few months ago, which was one important prerequisite for adding UTF-8
support to application programs. Support for applications will now
follow based on feedback from users who found in experiments that
something doesn't work yet with the very latest releases.

> So: Will some unix-flavours cope with the issue better than others?

It was my impression that Linux is at the moment the by far leading
POSIX operating system with regard to UTF-8 support (if we don't count
Thompson's and Ritchie's Plan9 operating system of course, which
switched over to UTF-8 as its only character encoding almost 10 years
ago). Followed by Solaris and AIX. Don't know about Mac OS X.

> I also do not see info on how to make keyboard drivers for UTF-8.

For Sami language support, this is simple. Just use xmodmap and remap
your keysyms as requires onto your keys. Xterm and other UTF-8 software
will understand all the standard keysyms, plus you can genereate a
keysym for any Unicode character by adding 0x01000000 to its code
position. Input methods are indeed a serious problem at the moment for
languages which character repertoirs much larger than the number of keys
on a keyboard.

> Where do I find info on how to build a keyboard layout for the
> character set just quoted?

Just man xmodmap

> What prompted me to write this was the recent thread discussing whethter it
> is possible to print UTF-8 encoded characters or not. This seems to me a
> pretty basic demand. I have been working for a unix based lg tech company,
> and despite its being unix-based and multilingual, our technical personel
> has decided not to migrate to UTF-8, due to open questions of the type just
> presented. Perhaps that was not an unreasonable decision after all?

UTF-8 support is still in an early stage. There are now two hands full
of enthusiasts and developers who run already their entire Linux system
completely in UTF-8, but this will still causes glitches with a number
of applications (which you can work around if you are experienced).
UTF-8 support in Emacs is provisional only, we still lack decent UTF-8
support in major email clients and plaintext->postscript formatters.
UTF-8 is now quite well supported now by various infrastructure blocks
such as glibc, gettext, xterm, and related packages. There are first
versions of UTF-8 support available in the latest versions of Perl, TCL/
Tk, Python, but the developers will not hesitate to agree that this is
still more experimental than mature.

> In despair, I look for alternatives. One is to pick some 8-bit codetable
> while waiting for UTF-8 issues to settle at least to a certain point.

Unless you work with a carefully selected set of tools and know exactly
what you are doing, ISO 8859-4 might at this time still be better for
doing real work than UTF-8. We'd appreciate however if you tried UTF-8
from time to time and ask here about any sort of problem that you
encountered. If we receive specific complaints about what does not yet
work, this might well speed up things. Converting between ISO 8859-4 and
UTF-8 is trivial with iconv, as long as you restrict yourself to the ISO
8859-4 repertoire and don't use the wider range of typographic
characters that UTF-8 provides (directional quotation marks, etc.

> One candidate codetable is 8859-4. What I find there is that the pipe character
> ("|") of 8859-1 (xA6) is replaced by CAPITAL L WITH COMMA BELOW.

No problem here, you merely mixed up the ASCII pipe character 

  | U+007C  VERTICAL LINE

and the completely useless EBCDIC reliq

  ¦ U+00A6  BROKEN BAR

ISO 8859-1 GR characters are usually not used with special semantics in
Unix. All ISO 8859 sets are ASCII supersets.

> From a
> unix point of view, and since I will not be doing Baltic lgs, this sounds
> like a fatal loss. Does this mean that Latin 4 cannot be used for unix
> purposes, or does it mean that I can just use it, but remember that L WITH
> COMMA means "pipe this input to the next command" (and even make a bastard
> font resembling Latin 4 but having the glyph "|" instead, in order not to
> get to ugly commands?)

No, no, and no.

Recommended reading:

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: utf-8, Latin 4, and basic unix commands

Reply via email to