In 'perlunicode' under 'when Unicode does not happen' there is the statement
(regarding Unicode functionality and the file system-related functions and
operators - BTW the author missed '-X' ):

Oops.


"One reason why Perl does not attempt to resolve the role of Unicode in this
cases is that the answers are highly dependent on the operating system and
the file system(s)."


This statement seems a bit evasive - especially since there are no other
reasons listed. What are the other reasons? Is there a plan to make things
easier in this area? Or not? If not, why?

I don't understand. It is intentionally evasive because the answer is intentionally evasive... what "other reasons"? No. Because the answer depend highly on the operating system and on the file system(s).

This statement also seems like an exaggeration. On Unix-like systems, it is
obvious how to deal with the file system - you convert Unicode to multibyte.

*Which* multibyte? There are multiple different encodings just for Unicode.
Perl cannot know which is one is being used. Just by the virtue of "being
in UNIX" the process cannot start playing Unicode games with the filenames--
it must know it is in a right directory / filesystem before doing that.
And _other applications_ must know about that-- otherwise (say) "foo" will
look like "\0f\0o\0o".


(AFAIK) W2K and later _are able_ to use UTF-16LE encoded Unicode for filenames,
but because of backward compatibility reasons using 8-bit codepages is much
more likely.


The Apple HFS handles Unicode using _normalized_ (NFC, IIRC) UTF-8.

There we have two different Unicode encodings, both in use.

On Windows, there is a choice when dealing with the file system - multibyte
or Unicode - but the new "-C" switch seems to cover that choice.

How so? The *old* -C switch (as in 5.6) did attempt to cover the Unicode
filenames support of Windows but Gurusamy Sarathy deemed the support broken
(one aspect of brokenness being that it was a global switch) and unused enough
that the -C switch was recycled to a completely a semantics that has nothing to
do with filenames, in Windows or anywhere else.


On other
systems - who knows (true!) but isn't that a porting issue? Those 'other'
people wouldn't be harmed by making things easier for the rest of us, right?

Any solutions will have to be OS-dependent, quite possibly application-dependent,
and I very much think the solutions do not belong to the core language.


Dealing with qx and 'system()' also seems less than mysterious to me -
there's no 'wfork' - as far as I know - so you use multibyte.

How do you know what kinds of strings are sent to the system? How do you know what kinds of strings are returned from the system?

Don't get me wrong - I love the Unicode support in Perl - it is an amazing
effort. But dealing with the file system (and to a lesser extent qx/system)
seems like a big hole to me

It is a big hole but I think Perl cannot portably do much to fill it. Perl cannot know what your filesystem can handle.

An example. Say I have a Shift-JIS string and I
want to do a mkdir (on a Shift-JIS-enabled OS with 5.8.1 build 807):

"a Shift-JIS enabled OS"? I have no idea what do you mean by that. OSes are somewhat unlikely do assume a character set since that's rather more an application level issue.

$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;

The above works the way I'd expect, although

print (-d $newdir ? 'yes' : 'no');

prints 'no' - oops a character handling bug! The second byte of the kanji is
a backslash, which confuses Perl, apparently. "-d" really ought to assume
the user knows what he is doing

I tend to misbelieve that :-) All "-d" is doing is passing the $newdir (UTF-8) bytes to stat(2).

and do character-handling based on the
current file system encoding setting (LC_CTYPE or the equivalent).

There is no portable "current file system encoding setting" API.


It seems counter-intuitive that this fails:

use encoding 'shiftjis';
$newdir = "kanji_here_\x89\x5C";
mkdir $newdir;

Whoops - I just created a directory with Unicode utf-8 bytes (which don't
assemble into valid Japanese characters). I don't think that's what most
users would expect - and 'mkdir' could do better than that.

How did you expect Perl to know that your filesystem expects and accepts shiftjis?
What if you do a chdir() to a filesystem that does not?


Anyway, I don't mean to criticize all the wonderful work that has been done.
This is more a question about future direction and also a request to update
the documentation - if this kind of thing isn't going to be fixed soon it
would be nice to add some sample code showing how to write a proper
Unicode-ized Perl script that deals with the file system properly.

Perl 5.8 as all the bits and pieces required to do whatever you want
with filenames but it cannot know when to do which conversions. In some
cases and OSes you can just convert the characters into whatever bytes you
want and push them out as-is (say, a directory name into UTF-8), and the
system will do happily just that. In some cases you would need to call
a different set of system calls (like in Windows).


Again, I think the right way to do what you want is to create a set
of (operating system dependent) modules (some may require XS) that introduce
the necessary filesystem-related (mkdir etc) variants (or overrides, if one
wants those).


--
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen





Reply via email to