Re: Comments on locale name guideline: CODESET names

Pablo Saratxaga Wed, 20 Jun 2001 13:20:47 -0700
Kaixo!

On Wed, Jun 20, 2001 at 05:44:44PM +0100, Markus Kuhn wrote:
> On Wed, 20 Jun 2001, Pablo Saratxaga wrote:
> > The standard Vietnamese encoding is TCVN-5712 not VISCII.
> 
> Standard? No. If it's still not even in the IANA registry, it can't be
> that widely used.

There was a time when KOI8-R wasn't registered either, and yet it was
the standard in Russia, despite the fact that some documents claimed
it was iso-8859-5.

I don't know why it isn't registered in the IANA registry; but that doesn't
change anything to the fact.

> Also show me a terminal emulator that displays the
> combining characters in TCVN-5712 correctly!

show me a TCVN encoded text using combining characters first :)
It is not needed to use them.

> The UTF-8 xterm and its
> fonts are by now a couple of thousand times more widely deployed than
> any TCVN-5712 capable terminal emulator,

No need of a special terminal; having a right font is enough.

> which makes UTF-8 the by far
> best supported encoding for Vietnamese today.

The best supported (as in the one best working without need of extra
settings), maybe. The most used, no.
To begin with, there are input methods for tcvn (and other 8bit encodings)
and not for utf-8.

> The Vietnamese locale
> currently installed on all Linux systems is UTF-8 only.

well, I've provided both TCVN and VISCII for a while, first as a set
of files to run setlocale on, then as part of Linux-Mandrake; I should
have sent the patches earlier (I just kept forgetting as I was busy on other
things). Debian also should provide it I think (at least I had a mail
exchange with a Debian developpers and sent him my patches)

> TCVN-5712 is very dead on Linux and has already lost the race against UTF-8.

I don't think so. All the people I know that use vietnamese languages
on a GNU/Linux system use either TCVN-5712 or, for some, VISCII.
UTF-8 was simply not working, until very recently (and it has still some
problems; more problems than TCVN-5712. Only mailing and http is better
done in utf-8, imho).
Ah, and vietnamese LaTeX documents use TCVN encoding too.

Mmh, look at 'TSCII' example, it is not on GNU/Linux distribs, and
probably not on IANA either; however, it is used, and there are even
KDE and Gnome translations in that encoding (in fact all the Tamil po files
I saw are in TSCII). "Not included in standard" is very different of
"not used".

> There is
> a little bit of use of VISCII though, so it deserves being mentioned,
> which I did.

TCVN-5712 should be added too then, for the same reason.

> > >  "WINDOWS-1251", "WINDOWS-1256"
> 
> > or "CP1251" and "CP1256" ? (that seems the preferred way and current
> > practise)
> 
> No. I find the idea very nice to stay *strictly* within MIME charset
> names. That simplifies interfacing to various Internet protocols (most
> notably MIME and HTTP) significantly.

A good point, indeed.
Mmmh, but the encoding name when creating a locale is taken from the
parameter given with setlocale -f, isn't it? so the file names on GNU libc
would need to be changed (or some links be done at least).

> Nobody gains anything from adding
> new names when there exists already a suitable managed namespace.

Is not a "new name". There isn't standardization until now (that is the
point of this proposal), so, anything that was earlier usage than the
first proposal can't be a "new name".
"CP1251" is not a new name, it's current usage.
Changing that current usage to gain the ability of use the encoding name
in MIME names for mail etc can be indeed a good idea.

> I prefer sound engineering arguments over vague ideas about what might
> or might not be preferred or current practise.

Well, names are jsut names, ther is nothing really technical about it,
it is just a way to refer to the encoding; and if a particular way to
refer to it is very strongly rooted in use, then that is weighter argument
imho (but that is not the case for cp1251 vs windows-1251 afaik. Oh, btw,
note that gettext (at least 0.10.37) accepts cp1251 and complains on
"windows-1251" (I had to made a small patch to stop it filling my screen
with warnings I knew were harmless each time I did a msgfmt --check).
Yes, a guideline on charset naming is indeed needed...)

>   http://www.cl.cam.ac.uk/~mgk25/ucs/setcode                               

Funny:

  VSCII )
    # G1D6: TCVN 5712-1993 (VSCII) = ISO IR 180
    echo -ne '\033-Z' ;;

:)

> Roozbeh Pournader <[EMAIL PROTECTED]> wrote:
> > Arabic (language) people rarely use visual charsets.
> 
> So what exactly are Arabic users of POSIX systems using in file names,
> source code comments, email, web pages, etc.?

In logical order.
They need special software for that.
There exist for example a shared lib that rewrites some X11 (and gtk?)
functions, allowing to have an arabized Gnome desktop by setting
LD_PRELOAD.
It is closed source.
There are also arabic editors and browsers. closed source too.
konqueror (from KDE, free software) displays very nicely arabic pages
in iso-8859-6 and windows-1256.

AFAIK only (old) Hebrew web pages implemented that silly idea of visual
encoding (which doesn't work very well anyway: change the windows width
and make the text wrap and it will become unreadable).
Nowadays all Hebrew enabled pages should be done for MS-IE which has
RTL support.

> What are the current
> setups and how many people do you think use them? Is there any widely
> used Arabic encoding that should be supported in locales in addition to
> UTF-8?

ISO-8859-6 and CP1256 imho.
It is hard however to say how long they will last once free software
arabization will be available; most of the current arabic support doesn't
give much choice for the encoding.

> What Linux terminal emulator and fonts are currently used for
> these Arabic locale definitions?

I don't know of any working xterm.
For fonts, a common encoding is "iso8859-6.8x"

http://www.langbox.com/arabic/fontara8X.html

> If there isn't a currently widely used Arabic terminal emulator (like
> kterm for the CJK community, which is very widely used), then the answer
> is probably that the Arabic script is not really widely used on Linux at
> the moment, and we can start supporting it from scratch properly with
> UTF-8 (see Robert Brady's work on Arabic xterm).

That's probable.

> Marco Cimarosti <[EMAIL PROTECTED]>:
> > In some cases, the only thing needed to implement them is the possibility
> > that glyphs  be "zero width", with the actual shape gutting on the right
> > side (or on the left side, in RTL scripts).
> 
> Such hacks work sometimes for simple text string display, but they fail
> on char cell terminal emulators, which get confused about the cursor
> position after a BDF zero-width combining character and explicitely have
> to know about the combining character. I'd advocate to worry about these
> mechanics only for UTF-8. There exists at present no widely deployed
> combining character terminal emulator support for any other encoding.

But people use not only xterm, but also lots of other programs.
In fact people using xterms are also likely to don't worry about the
xterm being in English and ascii; but they may want their other programs
to be localized.

char cell terminal support is in fact much harder; maybe that should
be separated into a different level, so that a system could be Arabic or
Thai compliant for GUI but not for xterms?

>> New encodings may be developed in future
> 
> At the moment, I believe everything points towards ISO 10646
> substituting all others with the next 10 years in >95% of all
> installations. Should this change, we can always reconsider and add new
> encodings to the small list.

I agree with you that there is no need to add new encodings; however I
disagree on the list you provided, I think that there should be 4 (only 4)
more encodings, that are currently in use. The list still remains small
but will be comprehensive.

-- 
Ki �a vos v�ye b�n,
Pablo Saratxaga

http://www.srtxg.easynet.be/            PGP Key available, key ID: 0x8F0E4975
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/
Re: Comments on locale name guideline: CODESET names

Reply via email to