Re: Interpretation of non-UTF8 strings

Nick Ing-Simmons Mon, 16 Aug 2004 03:17:10 -0700

Marcin 'Qrczak' Kowalczyk <[EMAIL PROTECTED]> writes:
>Should strings without the UTF8 flag be interpreted in the default
>encoding of the current locale or in ISO-8859-1?


This is a tricky question - and status quo is likely to remain 
for compatibility reasons.

>
>Perl treats them inconsistently. On one hand they are read from files
>and used as filenames without any recoding, which implies that they are
>assumed to be in some unspecified default encoding. 

Actually perl makes no such assumption - this is just historical
"it just works" code which is compatible with perl's before 5.6.

>On the other hand
>they are upgraded to UTF-8 as if they were ISO-8859-1.

This is possibly dubious practice, but was what happened in 5.6 
which had Unicode but no Encode module. That situation lasted 
long enough that there is a code base that relies on it.

In perl5.8 you can use explict Encode, or :encoding layer 
or use encoding or ... to get what you want.

>
>Perl is inconsistent whether "\xE0" or chr(0xE0) means the character
>0xE0 in the default encoding or U+00E0:
>
>perl -e '
>$x = "foo\xE0";
>$y = substr($x . chr(300), 0, 4);
>print $x eq $y, "\n";
>open F1, ">$x";
>open F2, ">$y"'
>
>The strings are equal, yet two filenames are created. I consider this
>behavior broken.

FWIW so do I, but concensus has not been reached on the right fix.

I would also like to see something akin to 'use locale' (which would 
treat 0xE0 according to locale's CTYPE), which treats 0x80..0xFF 
according to Unicode (== latin1 by definition) semantics.  

>
>IMHO it would be more logical to assume that strings without the UTF-8
>flag are in some default encoding, probably taken from the locale.
>Upgrading them to UTF-8 should take it into account instead of blildly
>assuming ISO-8859-1, 

It would be more logical but would break things.

>and using an UTF-8 string as a filename should
>convert it to this encoding back (on Unix) 

This is the tricky bit, theory goes that way forward on Unix
is (probably) UTF-8 filenames, and that in a UTF-8 locale and 
with 'use utf8' this works already. (There are a few rough edges...)

For older Unixes which don't do UTF-8 there is the issue of how 
you discover what the current locale's encoding is - if they 
are old enough to not have UTF-8 locales, they probably lack 
the API to get the encoding as well :-(

But I agree that getting two different file names is bad.

>or use UTF-16 API (on
>Windows).

Snag here is when you say "Windows" you mean WinNT and later, Win9X
(and WinME?) can't do that. For Win9x you have to convert to the 
current "code page" - akin to the Unix case.

>
>This leaves chr() ambiguous, 

It isn't ambiguous it is always (ignoring EBCDIC platforms for now)
Unicode/Latin1 - which can be represented one of two ways - UTF-8 or 
single octet. Representation is supposed to be invisible to perl code
but in case of file names it isn't.

>so there should be some other function for
>making Unicode code points, as chr should probably be kept for
>compatibility to mean the default encoding.

Re: Interpretation of non-UTF8 strings

Reply via email to