Re: Perl Unicode support

Egmont Koblinger Mon, 02 Apr 2007 04:30:23 -0700

On Fri, Mar 30, 2007 at 02:04:14PM -0400, Rich Felker wrote:

Hi,


> As you’ll see, the only arguments with which you can portably call
> setlocale are NULL, "", "C", "POSIX", and perhaps also a string
> previously returned by setlocale.

You can portably _call_ setlocale() with any argument, as long as you check
its return value and properly handle if it failed to fulfill your request.
The arguments you listed are probably those for which you can always assume
setlocale() to succeed. In the other cases you still might give it a chance
and see whether it succeeds.

> I’m interested only in portable applications, not “GNU/Linux
> applications”.

Our goals differ. Since I'm developing a Linux distro, I'm only interested
in developing GNU/Linux applications. We don't have any resources to check
the portability of our applications, neither want to make our job harder by
working with only a subset of the available functions and re-implement
what's already implemented in glibc. I don't think newer features that get
implemented in glibc are only to make its size bigger. I think they are for
the developers to use them when appropriate. They might not be appropriate
for a portable application, but usually are apropriate for our goals.

> > the documentation of newlocale() and uselocale() and *_l() functions
> 
> These are nonstandard extensions and are a horrible mistake in design
> direction. Having the character encoding even be selectable at runtime
> is partly a mistake, and should be seen as a temporary measure during
> the adoption of UTF-8 to allow legacy apps to continue working until
> they can be fixed.

No, first of all, they are not about multiple encodings, but multiple
locales. (It seems to me that you slightly mix up locale and encoding.
Encoding is only a part of locale and can be used independently of them.)
For example, if you create a German-French dictionary application, it's
expectable that German strings are sorted according to the German alphabet
rules, while French words are sorted using the French rules. Even if your
operating system doesn't support these locales, it might be a reasonable
decision if the application tried these locales and fell back to a default
sorting if they weren't available.

> If you look into the GNU *_l() functions, the majority of them exist
> primarily or only because of LC_CTYPE.

It seems to me that the majority of them exist because of cultural
differences, and there would be need for them if only UTF-8 existed.
Different time/date formats, different alphabetical sorting, different
lowercase-uppercase mapping etc.

> Applications which need to deal with multinational cultural
> expectations simultaneously probably need much stronger functionality
> than the standard library provides anyway, and would do best to use
> their own (possibly in library form) specialized machinery.

So far the functionality provided by glibc were sufficient for me and I
would have hated if I had to use an external library. ;)

Anyway, it would really be a bad decision if glibc didn't provide a way to
easily access the locale data that's originating from glibc and is already
accessible via glibc if you set a corresponding locale. Then the external
library you'd like to see would either need to access locale-data the same
way as glibc does, or had to provide the same information on its own form
again. Sounds terrible. External library is a good approach if some
information cannot be extracted by glibc _at all_.

For example, glibc doesn't know how many people live in Hungary, it's not
part of the locale data. If you need it, you may pick up an external library
that tells you this.

However, glibc knows how to alphabetically sort Hungarian strings. You claim
that it shouldn't let applications access this piece of information, unless
they have their LANG/LC_* environment variables set to hu_HU or some variant
of it. You say that applications should find a different way (different
library, maybe different database) to access this data if they needed it
even if the system locale was not Hungarian. This is totally absurd.


> There is no standard on which character encodings should be
> supported (which is a good thing, since eventually they can all be
> dropped.. and even before then, non-CJK systems may wish to omit the
> large tables for legacy CJK encodings),

I don't think support for the current 8-bit encoding will die within the
next 50 years, and (as an application developer) if the underlying operating
system (its iconv() calls) doesn't support a particular encoding, I'd
happily blame it on the OS and not think about workarounds. Practically this
means that if I need to process data in a particular encoding, I pass this
encoding to iconv_open() and cry out loud if it fails. You're right, I don't
expect iconv() to support ISO-8859-1, but still, if I need, I try it, use it
if availble, and print an error message otherwise. I won't implement it on
my own, the application is not the right place to do it.


> > file to contain them in only one language, the one you want to see. Hence
> > this program outputs plenty of configuration file, one for each window
> > manager and each language (icewm.en, icewm.hu, windowmaker.en,
> > windowmaker.hu and so on).
> 
> It would be nice if these apps would use some sort of message catalogs
> for their menus, and if they would perform the sorting themselves at
> runtime.

Yes, that'd be a theoretically better solution, but would require much-much
more work, would be less compatible with other distros, would be much harder
to adopt new window managers...

> You could use setlocale instead of the *_l() stuff so it would be
> portable to non-glibc.

If porting ever becomes an issue, I can still re-write it (with autoconf and
compile-time conditionals). Using the *_l() functions made the code cleaner
and probably faster.

> For a normal user application I would say this
> is an abuse of locales to begin with and that it should use its own
> collation data tables,

Own table? Why? What's the gain in shipping duplicated data? How are we
supposed to create collation tables for all languages? Why do you think it's
wrong if glibc allows access to these data and I use them?

> but what you’re doing seems reasonable for a
> system-specific maintainence script. The code looks nice. Clean use of
> plain C without huge bloated frameworks.

Thanks :)


> > I can't see any problem here. Can you? Browsers work correctly, don't they?
> > You ask me how I'd implement a feature that _is_ implemented in basically
> > any browser. I guess your browser handles frames and tabs with different
> > charset correctly, doesn't it? Even if you run it with an 8-bit locale.
> 
> I meant you run into trouble if you were going to change locale for
> each page. Obviously it works if you don’t use the locale system.

Well of course I didn't mean changing the _locale_ either, just convert
between _encodings_.


> Links works the other way: converting everything to the selected
> character encoding. Crappy versions of links (including the popular
> gui one) only support 8bit codepages, but recent ELinks supports
> UTF-8.

I know mainstream version of links is crappy. I haven't checked elinks yet,
I will do soon. Does it have a GUI version? In terminal, as I've said, it's
okay if it converts everything to the locale version, since in terminal it's
not possible to display out-of-default-locale's-charset characters. (Except
for the \e%G magic...) If it _is_ possible for an application to display
out-of-default-locale's-charset characters, IMO it _has_ to do so.


> Also applications that want to interact with other applications on the
> system expecting to receive text, e.g. an external text editor or
> similar.

They might convert back the data to the locale encoding before passing to
the external application. It's no excuse for not displaying them if it's
otherwise technically possible.


-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl Unicode support

Reply via email to