Re: perl unicode support

Rich Felker Thu, 29 Mar 2007 08:53:59 -0800

On Thu, Mar 29, 2007 at 12:01:28PM +0200, Egmont Koblinger wrote:
> On Wed, Mar 28, 2007 at 02:35:32PM -0400, Rich Felker wrote:
> 
> > > matches or not _does_ depend on the character set that you use. It's not
> > > perl's flaw that it couldn't decide, it's impossible to decide in theory
> > > unless you know the charset.
> > 
> > It is perl's flaw. The LC_CTYPE category of the locale determines the
> > charset. This is how all sane languages work.
> 
> LC_CTYPE determines the system charset. This is used when reading from /
> writing to a terminal, to/from text files by default; this is the charset
> you expect messages coming from glibc to be encoded in; etc...
> 
> But this is not necessarily the charset you want your application to work
> with. Think of Gtk+-2 for example, internally it always uses UTF-8, no
> matter what your locale is.


Gtk+-2’s approach is horribly incorrect and broken. By default it
writes UTF-8 filenames into the filesystem even if UTF-8 is not the
user’s encoding. 

> So it _has_ to tell every external regexp
> routine (if it uses any) to work with UTF-8, not with the charset implied by
> LC_CTYPE.

This is their fault for designing it wrong. If they correctly used the
requested encoding, there would be no problem.

> And you can think of any web browser, mail client and so on, they have to
> cope with the charset that particular web page or message uses, yet again
> independently from the system locale.

Not independently. All they have to do is convert it to the local
encoding. And yes I’m quite aware that a lot of information might be
lost in the process. That’s fine. If users want to be able to read
multilingual text, they NEED to migrate to a character encoding that
supports multilingual text. Trying to “work around” this [non-]issue
by mixing encodings and failing to respect LC_CTYPE is a huge hassle
for negative gain.

> > I don't have to be aware of it in any other language. It just works.
> 
> Show me your code that you think "just works" and I'll show you where you're
> wrong. :-)

Mutt is an excellent example.

> > Perl is being unnecessarily difficult here.
> 
> You forget one very important thing: Compatibility. In the old days Perl
> used 8-bit strings and there many people created many perl programs that
> handled 8-bit (most likely iso-8859-1) data. These programs must continue to
> work correctly with newer Perls. This implies that perl mustn't assume UTF-8
> charset for the data flows (even if your locale says so) since in this case
> it would produce different output.

Such programs could just as easily be run in a legacy locale, if
available on the system. But unless the data they’re processing
actually contains Latin-1 (in which case you’re in a Latin-1
environment!), there’s no reason that treating the strings as UTF-8
should cause any harm. ASCII is the same either way of course. The
only possible exception is if a perl program is using regex on true
binary data, which is a bit dubious to begin with.

> > Nonsense. As long as all the length variables are in the SAME unit,
> > your program has absolutely no reason to care whatsoever exactly what
> > that unit it. Any unit is just as good as long as it's consistent.
> 
> If you don't know what unit is used, then you're unable to answer questions
> whether that man is most likely healthy, whether he's extremely tall or
> extremely small.

Thresholds/formulae for what height is tall/small/healthy/whatever
just need to be written using whatever unit you’ve selected as the
global units.

> If you don't know what unit is used, how do you fill up your structures from
> external data source? What if you are supposed to store cm but the data
> arrives in inches? How would you know that you need to convert?

Same way it works with character encodings. The code importing
external data knows what format the internal data must be in. The
internal code has no knowledge or care what the unit/encoding is. This
keeps the internal code clean and simple.

> I guess you've heard several stories about million (billion?) dollar
> projects failing due to such stupid mistakes - one developer sending the
> data in centimeters, the other expecting them to arrive in inches.

Yes.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to