On Mon, 22 Sep 2014 17:58:41 -0400, Sven Van Caekenberghe <s...@stfx.eu> wrote:

I also find the way some problems are reported quite disturbing. How much testing did you do ? On which platforms ?

I can do this (in Pharo 3) without any problems (we're talking about arbitrary Unicode characters in path names):

('/tmp' asFileReference / 'été') ensureCreateDirectory.
'/tmp/été' asFileReference exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') writeStreamDo: [ :out |
  out << 'What about Greece ?' ].
('/tmp/été' asFileReference / 'Ελλάδα.txt') exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') contents.

And in a terminal, I get:

$ ls /tmp/été/Ελλάδα.txt
/tmp/été/Ελλάδα.txt

$ cat !$
cat /tmp/été/Ελλάδα.txt
What about Greece ?

This is on Mac OS X.

So this part fundamentally works in the image and on one VM. There might of course be problems in how paths are used in certain places or on certain VM/platforms.


Focusing purely on Unicode itself (not the encoding systems), a letter like é can be represented as U+00E9 (LATIN SMALL LETTER E WITH ACUTE), or as U+0065 (LATIN SMALL LETTER E) followed by U+0301 (combining acute accent). These will appear identical to the user, but are emphatically *not* identical for most software. The way you're testing here, you will not hit any error relating to this concept, ever, because you're using Pharo for both generating and consuming the strings. At the very least, we'd need to generate a file named "été" with both forms explicitly and see what happens.

Things get even more exciting, though, because Unix says that file names are simply arbitrary byte patterns that do not contain the null byte.* Thus, you can trivially create a file named "été" using Latin-1 encoding, and again using UTF-8 encoding, and again using UTF-7 encoding, and these might all be shown to the user as "identically" named, but I guarantee you that Pharo will not act sanely with all four of these. Even on Windows, where things are a bit saner (NTFS mandates UTF-16), and where an explicit normalization form is preferred (NFC), I just explicitly verified that I can trivially inject other normalization forms into the file system. Thus, you can still have two files named "été" that nevertheless have different names as far as the OS is concerned.

In this case, as far as I can tell, Pharo assumes that all path names are Unicode, and does not do any work to convert strings to or from the various normalization schemes (looking in Path class>>canonicalizeElements:, Path class>>from:delimiter, and FileSystemStore>>pathFromString: here).

There's therefore a pretty straightforward fix that Pharo could do:

  1. Path would use ByteArrays as the actual canonical store, and
     provide convenience methods to see what the array decodes to
     in various encodings.  The developer and application can make
     decisions about what encoding system they want to use.
  2. The VM likely needs to be modified to handle this (didn't check)

As much as I wish Hilaire provided more details in his bug report, it's worth keeping in mind that not all users, or even all programmers, understand the full implications of things like how various Unicode normalization and encoding schemes interact in practice with Unix's very vague concept of what a file name actually is, so I usually try to approach these bug reports carefully and with an open mind.

--Benjamin

* On OS X, HFS+ uses UTF-16 with an Apple-specific variant of NFD, whereas I do not believe this holds for e.g. UFS or FUSE-backed file systems, so things are a bit subtler there, but the general rule holds.

Reply via email to