Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> writes: >Should strings without the UTF8 flag be interpreted in the default >encoding of the current locale or in ISO-8859-1?
This is a tricky question - and status quo is likely to remain for compatibility reasons. > >Perl treats them inconsistently. On one hand they are read from files >and used as filenames without any recoding, which implies that they are >assumed to be in some unspecified default encoding. Actually perl makes no such assumption - this is just historical "it just works" code which is compatible with perl's before 5.6. >On the other hand >they are upgraded to UTF-8 as if they were ISO-8859-1. This is possibly dubious practice, but was what happened in 5.6 which had Unicode but no Encode module. That situation lasted long enough that there is a code base that relies on it. In perl5.8 you can use explict Encode, or :encoding layer or use encoding or ... to get what you want. > >Perl is inconsistent whether "\xE0" or chr(0xE0) means the character >0xE0 in the default encoding or U+00E0: > >perl -e ' >$x = "foo\xE0"; >$y = substr($x . chr(300), 0, 4); >print $x eq $y, "\n"; >open F1, ">$x"; >open F2, ">$y"' > >The strings are equal, yet two filenames are created. I consider this >behavior broken. FWIW so do I, but concensus has not been reached on the right fix. I would also like to see something akin to 'use locale' (which would treat 0xE0 according to locale's CTYPE), which treats 0x80..0xFF according to Unicode (== latin1 by definition) semantics. > >IMHO it would be more logical to assume that strings without the UTF-8 >flag are in some default encoding, probably taken from the locale. >Upgrading them to UTF-8 should take it into account instead of blildly >assuming ISO-8859-1, It would be more logical but would break things. >and using an UTF-8 string as a filename should >convert it to this encoding back (on Unix) This is the tricky bit, theory goes that way forward on Unix is (probably) UTF-8 filenames, and that in a UTF-8 locale and with 'use utf8' this works already. (There are a few rough edges...) For older Unixes which don't do UTF-8 there is the issue of how you discover what the current locale's encoding is - if they are old enough to not have UTF-8 locales, they probably lack the API to get the encoding as well :-( But I agree that getting two different file names is bad. >or use UTF-16 API (on >Windows). Snag here is when you say "Windows" you mean WinNT and later, Win9X (and WinME?) can't do that. For Win9x you have to convert to the current "code page" - akin to the Unix case. > >This leaves chr() ambiguous, It isn't ambiguous it is always (ignoring EBCDIC platforms for now) Unicode/Latin1 - which can be represented one of two ways - UTF-8 or single octet. Representation is supposed to be invisible to perl code but in case of file names it isn't. >so there should be some other function for >making Unicode code points, as chr should probably be kept for >compatibility to mean the default encoding.