Re: perlunicode comment - when Unicode does not happen

Jarkko Hietaniemi Mon, 22 Dec 2003 13:26:10 -0800

In 'perlunicode' under 'when Unicode does not happen' there is the statement (regarding Unicode functionality and the file system-related functions and operators - BTW the author missed '-X' ):

Oops.

"One reason why Perl does not attempt to resolve the role of Unicode in this cases is that the answers are highly dependent on the operating system and the file system(s)."

This statement seems a bit evasive - especially since there are no other reasons listed. What are the other reasons? Is there a plan to make things easier in this area? Or not? If not, why?


I don't understand.  It is intentionally evasive because the answer is
intentionally evasive... what "other reasons"?  No.  Because the answer
depend highly on the operating system and on the file system(s).

This statement also seems like an exaggeration. On Unix-like systems, it is obvious how to deal with the file system - you convert Unicode to multibyte.

*Which* multibyte? There are multiple different encodings just for Unicode. Perl cannot know which is one is being used. Just by the virtue of "being in UNIX" the process cannot start playing Unicode games with the filenames-- it must know it is in a right directory / filesystem before doing that. And _other applications_ must know about that-- otherwise (say) "foo" will look like "\0f\0o\0o".

(AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for filenames, but because of backward compatibility reasons using 8-bit codepages is much more likely.

The Apple HFS handles Unicode using _normalized_ (NFC, IIRC) UTF-8.

There we have two different Unicode encodings, both in use.

On Windows, there is a choice when dealing with the file system - multibyte or Unicode - but the new "-C" switch seems to cover that choice.

How so? The *old* -C switch (as in 5.6) did attempt to cover the Unicode filenames support of Windows but Gurusamy Sarathy deemed the support broken (one aspect of brokenness being that it was a global switch) and unused enough that the -C switch was recycled to a completely a semantics that has nothing to do with filenames, in Windows or anywhere else.

On other systems - who knows (true!) but isn't that a porting issue? Those 'other' people wouldn't be harmed by making things easier for the rest of us, right?

Any solutions will have to be OS-dependent, quite possibly application-dependent, and I very much think the solutions do not belong to the core language.

Dealing with qx and 'system()' also seems less than mysterious to me -
there's no 'wfork' - as far as I know - so you use multibyte.


How do you know what kinds of strings are sent to the system?  How do
you know what kinds of strings are returned from the system?

Don't get me wrong - I love the Unicode support in Perl - it is an amazing effort. But dealing with the file system (and to a lesser extent qx/system) seems like a big hole to me


It is a big hole but I think Perl cannot portably do much to fill it.
Perl cannot know what your filesystem can handle.

An example. Say I have a Shift-JIS string and I
want to do a mkdir (on a Shift-JIS-enabled OS with 5.8.1 build 807):


"a Shift-JIS enabled OS"?  I have no idea what do you mean by that.
OSes are somewhat unlikely do assume a character set since that's
rather more an application level issue.

$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;
The above works the way I'd expect, although

print (-d $newdir ? 'yes' : 'no');

prints 'no' - oops a character handling bug! The second byte of the kanji is a backslash, which confuses Perl, apparently. "-d" really ought to assume the user knows what he is doing


I tend to misbelieve that :-)  All "-d" is doing is passing the $newdir
(UTF-8) bytes to stat(2).

and do character-handling based on the
current file system encoding setting (LC_CTYPE or the equivalent).

There is no portable "current file system encoding setting" API.

It seems counter-intuitive that this fails:
use encoding 'shiftjis';
$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;
Whoops - I just created a directory with Unicode utf-8 bytes (which don't assemble into valid Japanese characters). I don't think that's what most users would expect - and 'mkdir' could do better than that.

How did you expect Perl to know that your filesystem expects and accepts shiftjis? What if you do a chdir() to a filesystem that does not?

Anyway, I don't mean to criticize all the wonderful work that has been done. This is more a question about future direction and also a request to update the documentation - if this kind of thing isn't going to be fixed soon it would be nice to add some sample code showing how to write a proper Unicode-ized Perl script that deals with the file system properly.

Perl 5.8 as all the bits and pieces required to do whatever you want with filenames but it cannot know when to do which conversions. In some cases and OSes you can just convert the characters into whatever bytes you want and push them out as-is (say, a directory name into UTF-8), and the system will do happily just that. In some cases you would need to call a different set of system calls (like in Windows).

Again, I think the right way to do what you want is to create a set of (operating system dependent) modules (some may require XS) that introduce the necessary filesystem-related (mkdir etc) variants (or overrides, if one wants those).

-- Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen

Re: perlunicode comment - when Unicode does not happen

Reply via email to