Re: Perl Unicode support

Egmont Koblinger Fri, 30 Mar 2007 09:21:10 -0800

On Fri, Mar 30, 2007 at 11:46:12AM -0400, Rich Felker wrote:

> What does “supports the encoding” mean? Applications cannot select the
> locale they run in, aside from requesting the “C” or “POSIX” locale.


This isn't so. First of all, see the manual page setlocale(3), as well as
the documentation of newlocale() and uselocale() and *_l() functions (no man
page for them, use google). These will show you how to switch to arbitrary
existing locale, no matter what your environment variables are.

Second, in order to perform charset conversion, you don't need locales at
all, you only need the iconv_open(3) and iconv(3) library calls. Yes, glibc
provides a function to convert between two arbitrary character sets, even if
the locale in effect uses a third, different charset.

> It’s the decision of the user and/or the system implementor. In fact
> it would be impossible to switch locales when visiting different pages
> anyway.

No, it's not impossible, and actually it's unneeded.

Just for curiosity: I wrote a menu generator for our distribution. This
loads the application menu from desktop files under /usr/share/applications,
and outputs menu files for various window managers, such as IceWM, Window
Maker, Enlightenment and so on. The input .desktop files contain the names
of software in multiple languages. Simple window managers expect the menu
file to contain them in only one language, the one you want to see. Hence
this program outputs plenty of configuration file, one for each window
manager and each language (icewm.en, icewm.hu, windowmaker.en,
windowmaker.hu and so on).

The entries are sorted alphabetically. But rules of alphabetical sorting
differs from language to language. Hence I have to use many locales. Before
dumping icewm.en, I have to switch to an English locale and perform sorting
there. Before dumping icewm.hu, I need to activate the Hungarian locale. And
so on.

Earlier versions of this program even included UTF-8 -> 8-bit conversions
(.desktop files use UTF-8 while our distro still used old-fashioned locale
those early days) and this 8-bit charset yet again differed from language to
language. So for example, when dumping icewm.fr, I converted the French
descriptions to Latin1, but when dumping icewm.hu, it had to be converted to
Latin2. In newer versions this part of the code is dropped since luckily
UTF-8 is used in the generated file.

Just in case you're interested, here's the source:
ftp://ftp.uhulinux.hu/sources/uhu-menu/

> How would you deal with multiple browser windows or tabs, or even frames?

I can't see any problem here. Can you? Browsers work correctly, don't they?
You ask me how I'd implement a feature that _is_ implemented in basically
any browser. I guess your browser handles frames and tabs with different
charset correctly, doesn't it? Even if you run it with an 8-bit locale.

One possible way is to convert each separate input stream (e.g. html page or
frame) from their encoding to a common internal representation (most likely
UTF-8). Technically there are some minor issues that make this more
complicated (e.g. the charset info can be inside the html file), but
theoretically there's absolutely no problem.

> Normal implementations work either by converting all data to the
> user’s encoding, or by converting it all to some representation of
> Unicode (UTF-8 or UTF-32, or something nonstandard like UTF-21).

Normal implementations work the 2nd way, that is, use a Unicode-compatible
internal encoding. From the user's point of view there's only one difference
between the two ways. Using the 1st way characters not present in your
current locale are lost. Using the 2nd way they are kept and displayed
correctly. Hence I still can't see any reason for choosing the 1st way
(except for terminal applications that have to stick to the terminal
charset).



-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl Unicode support

Reply via email to