Re: Comments on locale name guideline: CODESET names

Markus Kuhn Wed, 20 Jun 2001 09:22:12 -0700
On Wed, 20 Jun 2001, Pablo Saratxaga wrote:
> The standard Vietnamese encoding is TCVN-5712 not VISCII.

Standard? No. If it's still not even in the IANA registry, it can't be
that widely used. Also show me a terminal emulator that displays the
combining characters in TCVN-5712 correctly! The UTF-8 xterm and its
fonts are by now a couple of thousand times more widely deployed than
any TCVN-5712 capable terminal emulator, which makes UTF-8 the by far
best supported encoding for Vietnamese today. The Vietnamese locale
currently installed on all Linux systems is UTF-8 only. TCVN-5712 is
very dead on Linux and has already lost the race against UTF-8. There is
a little bit of use of VISCII though, so it deserves being mentioned,
which I did.

> >  "WINDOWS-1251", "WINDOWS-1256"

> or "CP1251" and "CP1256" ? (that seems the preferred way and current
> practise)

No. I find the idea very nice to stay *strictly* within MIME charset
names. That simplifies interfacing to various Internet protocols (most
notably MIME and HTTP) significantly. Nobody gains anything from adding
new names when there exists already a suitable managed namespace. Any
useful C library implementation will have to accept the MIME encoding
names anyway as aliases.

I prefer sound engineering arguments over vague ideas about what might
or might not be preferred or current practise. Having the exact same
name as in MIME for encodings simplifies lots of things, and that is a
GOOD THING[TM].

Thomas Chan <[EMAIL PROTECTED]>:
> These two criteria effectively banish Traditional Chinese in zh_TW,
> which uses either the de facto industry standard Big5, or less
> preferably, EUC-TW.

I have nothing in principle against EUC-TW, but if it is not widely
used, then time spend on encouraging people to use EUC-TW is far better
spent encouraging them to use UTF-8 IMHO. That's one reason, why I
didn't include EUC-TW. The other reason is that EUC-TW seems the only of
the EUC standards that is not in conformance to ISO 2022. At least, I
really couldn't figure out the ISO 2022 designator sequences for EUC-TW.
For all other EUC versions, they are listed in

  http://www.cl.cam.ac.uk/~mgk25/ucs/setcode                               

But that's more a side note, nothing essential. Had EUC-TW been in the
MIME registry, I definitely would have added it, and that could still be
fixed.

> Big5-HKSCS

Please show me the many good Big5-HKSCS fonts that all the claimed lots
of Big5-HKSCS users use on POSIX systems! Not even MULE seems to come
with such fonts, so it can't be that critical and widely used.

Roozbeh Pournader <[EMAIL PROTECTED]> wrote:
> Arabic (language) people rarely use visual charsets.

So what exactly are Arabic users of POSIX systems using in file names,
source code comments, email, web pages, etc.? What are the current
setups and how many people do you think use them? Is there any widely
used Arabic encoding that should be supported in locales in addition to
UTF-8? What Linux terminal emulator and fonts are currently used for
these Arabic locale definitions?

If there isn't a currently widely used Arabic terminal emulator (like
kterm for the CJK community, which is very widely used), then the answer
is probably that the Arabic script is not really widely used on Linux at
the moment, and we can start supporting it from scratch properly with
UTF-8 (see Robert Brady's work on Arabic xterm).

Marco Cimarosti <[EMAIL PROTECTED]>:
> In some cases, the only thing needed to implement them is the possibility
> that glyphs  be "zero width", with the actual shape gutting on the right
> side (or on the left side, in RTL scripts).

Such hacks work sometimes for simple text string display, but they fail
on char cell terminal emulators, which get confused about the cursor
position after a BDF zero-width combining character and explicitely have
to know about the combining character. I'd advocate to worry about these
mechanics only for UTF-8. There exists at present no widely deployed
combining character terminal emulator support for any other encoding.

If someone (Tomohiro) wants to spread doubt over my technical opinion
with generic claims about my lack of knowledge of apparently widely
deployed implementation and use, then a long list of URLs with evidence
to the contrary (RPMs of support software, HOWTOS, etc.) should be easy
to come by and is the least can I expect as supporting evidence for the
contrary of what I claimed.

Tomohiro KUBOTA <[EMAIL PROTECTED]>:
> New encodings may be developed in future

At the moment, I believe everything points towards ISO 10646
substituting all others with the next 10 years in >95% of all
installations. Should this change, we can always reconsider and add new
encodings to the small list. At the moment, there is no compelling
reason to expect any new encodings becoming dominant. There can be
little doubt that a single global unified character encoding will be a
big blessing for all users and vendors equally.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: Comments on locale name guideline: CODESET names

Reply via email to