Re: Perl Unicode support

Rich Felker Fri, 30 Mar 2007 09:52:06 -0800

On Fri, Mar 30, 2007 at 07:06:52PM +0200, Egmont Koblinger wrote:
> On Fri, Mar 30, 2007 at 11:46:12AM -0400, Rich Felker wrote:
> 
> > What does “supports the encoding” mean? Applications cannot select the
> > locale they run in, aside from requesting the “C” or “POSIX” locale.
> 
> This isn't so. First of all, see the manual page setlocale(3), as well as


The documentation of setlocale is here:
http://www.opengroup.org/onlinepubs/009695399/functions/setlocale.html

As you’ll see, the only arguments with which you can portably call
setlocale are NULL, "", "C", "POSIX", and perhaps also a string
previously returned by setlocale.

I’m interested only in portable applications, not “GNU/Linux
applications”.

> the documentation of newlocale() and uselocale() and *_l() functions (no man
> page for them, use google). These will show you how to switch to arbitrary
> existing locale, no matter what your environment variables are.

These are nonstandard extensions and are a horrible mistake in design
direction. Having the character encoding even be selectable at runtime
is partly a mistake, and should be seen as a temporary measure during
the adoption of UTF-8 to allow legacy apps to continue working until
they can be fixed. In the future we should have much lighter, sleeker,
more maintainable systems without runtime-selectable character
encoding.

If you look into the GNU *_l() functions, the majority of them exist
primarily or only because of LC_CTYPE. The madness of having locally
bindable locale would not be so mad if these could all be thrown out,
and if only the ones that actually depend on cultural customs instead
of on character encoding could be kept.

However, I suspect even then it’s a mistake. Applications which just
need to present data to the user in a form that’s comfortable to the
user’s cultural expectations are fine with a single global locale.
Applications which need to deal with multinational cultural
expectations simultaneously probably need much stronger functionality
than the standard library provides anyway, and would do best to use
their own (possibly in library form) specialized machinery.

> Second, in order to perform charset conversion, you don't need locales at
> all, you only need the iconv_open(3) and iconv(3) library calls. Yes, glibc
> provides a function to convert between two arbitrary character sets, even if
> the locale in effect uses a third, different charset.

Yes, I’m well aware. This is not specific to glibc but part of the
standard. There is no standard on which character encodings should be
supported (which is a good thing, since eventually they can all be
dropped.. and even before then, non-CJK systems may wish to omit the
large tables for legacy CJK encodings), nor on the names for the
encodings (which is rather stupid; it would be very reasonable and
practical for SUS to mandate that, if an encoding is supported, it
must be supported under its standard preferred MIME name). The
standard also does not necessarily guarantee a direct conversion from
A to C, even if conversions from A to B and B to C exist.

> file to contain them in only one language, the one you want to see. Hence
> this program outputs plenty of configuration file, one for each window
> manager and each language (icewm.en, icewm.hu, windowmaker.en,
> windowmaker.hu and so on).

It would be nice if these apps would use some sort of message catalogs
for their menus, and if they would perform the sorting themselves at
runtime.

> Just in case you're interested, here's the source:
> ftp://ftp.uhulinux.hu/sources/uhu-menu/

You could use setlocale instead of the *_l() stuff so it would be
portable to non-glibc. For a normal user application I would say this
is an abuse of locales to begin with and that it should use its own
collation data tables, but what you’re doing seems reasonable for a
system-specific maintainence script. The code looks nice. Clean use of
plain C without huge bloated frameworks.

> > How would you deal with multiple browser windows or tabs, or even frames?
> 
> I can't see any problem here. Can you? Browsers work correctly, don't they?
> You ask me how I'd implement a feature that _is_ implemented in basically
> any browser. I guess your browser handles frames and tabs with different
> charset correctly, doesn't it? Even if you run it with an 8-bit locale.

I meant you run into trouble if you were going to change locale for
each page. Obviously it works if you don’t use the locale system.

> > Normal implementations work either by converting all data to the
> > user’s encoding, or by converting it all to some representation of
> > Unicode (UTF-8 or UTF-32, or something nonstandard like UTF-21).
> 
> Normal implementations work the 2nd way, that is, use a Unicode-compatible
> internal encoding.

Links works the other way: converting everything to the selected
character encoding. Crappy versions of links (including the popular
gui one) only support 8bit codepages, but recent ELinks supports
UTF-8.

> From the user's point of view there's only one difference
> between the two ways. Using the 1st way characters not present in your
> current locale are lost. Using the 2nd way they are kept and displayed
> correctly. Hence I still can't see any reason for choosing the 1st way
> (except for terminal applications that have to stick to the terminal
> charset).

Also applications that want to interact with other applications on the
system expecting to receive text, e.g. an external text editor or
similar.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl Unicode support

Reply via email to