W liście z nie, 22-08-2004, godz. 23:14 +0300, Jarkko Hietaniemi napisał:
> > - file contents, including stdin/stdout/stderr and sockets, > > unless overridden explicitly > > - filenames (including functions like mkdir, stat, glob) > > - arguments of system and exec > > - @ARGV > > - %ENV > > - $! when it contains the result of strerror() > > - and probably other similar things I've forgotten. > > Yes, there are still many holes in our "Unicode armor", as is documented > in the perlunicode. Where we more are less stopped for 5.8 was where > the unportabilities between operating systems became evident and harder > to resolve. I know you care only for your platform but Perl has to > consider all the platforms: currently at least Linux, Win32, and Mac OS > X have "serious" Unicode platform support. Being portable over the > filesystems, CLIs, and APIs is no fun. At least we agree that there is work to do in this area. Portability is not a sufficient excuse though. There are bugs, like that with double recoding, or with $ARGV[0] not being equivalent to substr($ARGV[0], 0). The API is, I'm afraid, not good enough, even if we ignore the old mode of manipulating data in its external encoding. Namely, it doesn't distinguish specifying the encoding of the script source (which depends on where it has been written) from specifying the encoding that the script should assume on STDIN/STDOUT/STDERR and other places (which depends on where it is being run). Well, other places when implemented, assuming it will be indeed triggered by the 'encoding' pragma. I hope the -C flag is considered a temporary hack, to be eventually replaced with somethings which supports other encodings and not only UTF-8. I guess my language bridge should use only the open pragma, then call binmode on STDIN/STDOUT/STDERR, and remember that everything besides file contents is encoded in ISO-8859-1. The encoding pragma is buggy, its impact might change in future, and I really think it should be split into two: for encoding of the script source and for the default encoding of external world channels which don't specify an encoding explicitly. The second half of the split should include an interface to the open pragma. If not only with a single central switch, perhaps with separate options for separate places, something similar to use encoding files => "ISO-8859-2"; use encoding terminal => "UTF-8"; use encoding filenames => "ISO-8859-1"; use encoding env => "locale"; We should also consider what to do on recoding errors (fail?) and how to specify alternative behaviors. On problems with decoding an email, it's better to show garbage while warning the user that the encoding is wrong than to die and not show anything at all. We should think how it interacts with locale-aware behavior of functions. Without 'use locale' and other pragmas it's clear: Perl consistently assumes that every text is ISO-8859-1. When something like 'use encoding' is in effect, Perl still interprets the scalars in the same way, but treats them differently when they interact with the world. But with 'use locale' it assumes that non-UTF-8 scalars are in the current locale encoding, which is incompatible with the assumptions taken when UTF-8 scalars and non-UTF-8 scalars are mixed. So it will probably never work together. If 'use locale' includes some essential features besides the treatment of texts, like date/time formatting, it should be available by other means, without at the same time causing ord(lc(chr(161))) to be equal to 177, which doesn't make sense if character codes are interpreted according to Unicode. It implies that when localized texts are taken from the system, they must be decoded from the locale encoding. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/