Re: Interpretation of non-UTF8 strings

Marcin 'Qrczak' Kowalczyk Tue, 24 Aug 2004 06:43:29 -0700

W liście z nie, 22-08-2004, godz. 23:14 +0300, Jarkko Hietaniemi
napisał:


> > - file contents, including stdin/stdout/stderr and sockets,
> >   unless overridden explicitly
> > - filenames (including functions like mkdir, stat, glob)
> > - arguments of system and exec
> > - @ARGV
> > - %ENV
> > - $! when it contains the result of strerror()
> > - and probably other similar things I've forgotten.
> 
> Yes, there are still many holes in our "Unicode armor", as is documented
> in the perlunicode.  Where we more are less stopped for 5.8 was where
> the unportabilities between operating systems became evident and harder
> to resolve.  I know you care only for your platform but Perl has to
> consider all the platforms: currently at least Linux, Win32, and Mac OS
> X have "serious" Unicode platform support.  Being portable over the
> filesystems, CLIs, and APIs is no fun.

At least we agree that there is work to do in this area.

Portability is not a sufficient excuse though. There are bugs, like
that with double recoding, or with $ARGV[0] not being equivalent to
substr($ARGV[0], 0).

The API is, I'm afraid, not good enough, even if we ignore the old mode
of manipulating data in its external encoding. Namely, it doesn't
distinguish specifying the encoding of the script source (which depends
on where it has been written) from specifying the encoding that the
script should assume on STDIN/STDOUT/STDERR and other places (which
depends on where it is being run). Well, other places when implemented,
assuming it will be indeed triggered by the 'encoding' pragma.

I hope the -C flag is considered a temporary hack, to be eventually
replaced with somethings which supports other encodings and not only
UTF-8.

I guess my language bridge should use only the open pragma, then call
binmode on STDIN/STDOUT/STDERR, and remember that everything besides
file contents is encoded in ISO-8859-1. The encoding pragma is buggy,
its impact might change in future, and I really think it should be split
into two: for encoding of the script source and for the default encoding
of external world channels which don't specify an encoding explicitly.
The second half of the split should include an interface to the open
pragma. If not only with a single central switch, perhaps with separate
options for separate places, something similar to

use encoding files => "ISO-8859-2";
use encoding terminal => "UTF-8";
use encoding filenames => "ISO-8859-1";
use encoding env => "locale";

We should also consider what to do on recoding errors (fail?) and how to
specify alternative behaviors. On problems with decoding an email, it's
better to show garbage while warning the user that the encoding is wrong
than to die and not show anything at all.

We should think how it interacts with locale-aware behavior of
functions. Without 'use locale' and other pragmas it's clear: Perl
consistently assumes that every text is ISO-8859-1. When something like
'use encoding' is in effect, Perl still interprets the scalars in the
same way, but treats them differently when they interact with the world.

But with 'use locale' it assumes that non-UTF-8 scalars are in the
current locale encoding, which is incompatible with the assumptions
taken when UTF-8 scalars and non-UTF-8 scalars are mixed. So it will
probably never work together. If 'use locale' includes some essential
features besides the treatment of texts, like date/time formatting,
it should be available by other means, without at the same time causing
ord(lc(chr(161))) to be equal to 177, which doesn't make sense if
character codes are interpreted according to Unicode. It implies
that when localized texts are taken from the system, they must be
decoded from the locale encoding.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Interpretation of non-UTF8 strings

Reply via email to