Re: perl unicode support

Egmont Koblinger Thu, 29 Mar 2007 02:02:09 -0800

On Wed, Mar 28, 2007 at 02:35:32PM -0400, Rich Felker wrote:

> > matches or not _does_ depend on the character set that you use. It's not
> > perl's flaw that it couldn't decide, it's impossible to decide in theory
> > unless you know the charset.
> 
> It is perl's flaw. The LC_CTYPE category of the locale determines the
> charset. This is how all sane languages work.


LC_CTYPE determines the system charset. This is used when reading from /
writing to a terminal, to/from text files by default; this is the charset
you expect messages coming from glibc to be encoded in; etc...

But this is not necessarily the charset you want your application to work
with. Think of Gtk+-2 for example, internally it always uses UTF-8, no
matter what your locale is. So it _has_ to tell every external regexp
routine (if it uses any) to work with UTF-8, not with the charset implied by
LC_CTYPE.

And you can think of any web browser, mail client and so on, they have to
cope with the charset that particular web page or message uses, yet again
independently from the system locale.

So, to stay at our example of a fictional regexp matching library: If this
library insists on assuming that the strings are encoded according to
LC_CTYPE then it's quite hard to use it correctly in such circumstances.
(You might need to write a wrapper that alters the locale temporarily -- but
could you tell me how to find a locale whose charset is one particular
charset?) If the charset the regexp library expects _defaults_ to LC_CTYPE
but is overridable then it's much better. And for libraries such as
glib2/gtk2 which force using utf-8 internally it's of course perfectly okay
if they implement an utf8-only regexp matching function.


> I don't have to be aware of it in any other language. It just works.

Show me your code that you think "just works" and I'll show you where you're
wrong. :-)

> Perl is being unnecessarily difficult here.

You forget one very important thing: Compatibility. In the old days Perl
used 8-bit strings and there many people created many perl programs that
handled 8-bit (most likely iso-8859-1) data. These programs must continue to
work correctly with newer Perls. This implies that perl mustn't assume UTF-8
charset for the data flows (even if your locale says so) since in this case
it would produce different output.


> Nonsense. As long as all the length variables are in the SAME unit,
> your program has absolutely no reason to care whatsoever exactly what
> that unit it. Any unit is just as good as long as it's consistent.

If you don't know what unit is used, then you're unable to answer questions
whether that man is most likely healthy, whether he's extremely tall or
extremely small.

If you don't know what unit is used, how do you fill up your structures from
external data source? What if you are supposed to store cm but the data
arrives in inches? How would you know that you need to convert?

What if multiple external data sources use different units? If you ignore
the whole problem you'll end up with different units in your database where
even adding two numbers doesn't make any sense - just as it doesn't make any
sense to simply concatenate two byte sequences that represent text in
different encodings.

I guess you've heard several stories about million (billion?) dollar
projects failing due to such stupid mistakes - one developer sending the
data in centimeters, the other expecting them to arrive in inches.


-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to