utf-8, Latin 4, and basic unix commands

Trond Trosterud Tue, 17 Apr 2001 02:05:37 -0700
I am to embark on a project involving developing tools for Sami in an unix
environment and storing corpora to be accessed on the web, using ISO 646 +
the UCS characters 00C1 , 00E1 , 010C , 010D , 0110 , 0111 , 014A , 014B ,
0160 , 0161 , 0166 , 0167 , 017D , 017E to render Sami. These are available
in 8859-4. Thinking long-term, I would like to use UTF-8.

If I read this list right, UTF-8 on Emacs is not quite there yet. There is
something called MULE, but people are waiting for the next Emacs version to
handle UTF-8 smoothly. I will not be able to do my work without emacs. When
is the next, UTF-8-friendly version scheduled?

I have scanned through all info I have found, but still do not quite see
how basic unix commands such as wc, sort, etc. handle UTF-8. Some of them
do not need to know that some characters are single-byte, others are not.
But others (wc, etc.) do. I might have overlooked some obvious information,
but I do not find info on this. So: Will some unix-flavours cope with the
issue better than others? For my purpose: Can a buy a mac os X and suppose
that the Unix in that box will behave the way other unixes do?

I also do not see info on how to make keyboard drivers for UTF-8. This
might be trivial, and therefore not mentioned, but I will need an input
device. Where do I find info on how to build a keyboard layout for the
character set just quoted?

What prompted me to write this was the recent thread discussing whethter it
is possible to print UTF-8 encoded characters or not. This seems to me a
pretty basic demand. I have been working for a unix based lg tech company,
and despite its being unix-based and multilingual, our technical personel
has decided not to migrate to UTF-8, due to open questions of the type just
presented. Perhaps that was not an unreasonable decision after all?

In despair, I look for alternatives. One is to pick some 8-bit codetable
while waiting for UTF-8 issues to settle at least to a certain point. One
candidate codetable is 8859-4. What I find there is that the pipe character
("|") of 8859-1 (xA6) is replaced by CAPITAL L WITH COMMA BELOW. From a
unix point of view, and since I will not be doing Baltic lgs, this sounds
like a fatal loss. Does this mean that Latin 4 cannot be used for unix
purposes, or does it mean that I can just use it, but remember that L WITH
COMMA means "pipe this input to the next command" (and even make a bastard
font resembling Latin 4 but having the glyph "|" instead, in order not to
get to ugly commands?)


Greetings,

-------------------------------------------------------------------
Trond Trosterud                                     t +47 7764 4763
Det humanistiske fakultet                           h +47 7767 3639
N-9037 Universitetet i Tromsø, Noreg                f +47 7764 4239
[EMAIL PROTECTED]           http://www.hum.uit.no/a/trond/
-------------------------------------------------------------------



-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
utf-8, Latin 4, and basic unix commands

Reply via email to