Re: [Pharo-users] Ridiculous we are

Benjamin Pollack Wed, 24 Sep 2014 09:49:14 -0700

On Mon, 22 Sep 2014 17:58:41 -0400, Sven Van Caekenberghe <s...@stfx.eu>wrote:

I also find the way some problems are reported quite disturbing. Howmuch testing did you do ? On which platforms ?
I can do this (in Pharo 3) without any problems (we're talking aboutarbitrary Unicode characters in path names):
('/tmp' asFileReference / 'été') ensureCreateDirectory.
'/tmp/été' asFileReference exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') writeStreamDo: [ :out |
  out << 'What about Greece ?' ].
('/tmp/été' asFileReference / 'Ελλάδα.txt') exists.
('/tmp/été' asFileReference / 'Ελλάδα.txt') contents.

And in a terminal, I get:

$ ls /tmp/été/Ελλάδα.txt
/tmp/été/Ελλάδα.txt

$ cat !$
cat /tmp/été/Ελλάδα.txt
What about Greece ?

This is on Mac OS X.
So this part fundamentally works in the image and on one VM. There mightof course be problems in how paths are used in certain places or oncertain VM/platforms.

Focusing purely on Unicode itself (not the encoding systems), a letterlike é can be represented as U+00E9 (LATIN SMALL LETTER E WITH ACUTE), oras U+0065 (LATIN SMALL LETTER E) followed by U+0301 (combining acuteaccent). These will appear identical to the user, but are emphatically*not* identical for most software. The way you're testing here, you willnot hit any error relating to this concept, ever, because you're usingPharo for both generating and consuming the strings. At the very least,we'd need to generate a file named "été" with both forms explicitly andsee what happens.

Things get even more exciting, though, because Unix says that file namesare simply arbitrary byte patterns that do not contain the null byte.*Thus, you can trivially create a file named "été" using Latin-1 encoding,and again using UTF-8 encoding, and again using UTF-7 encoding, and thesemight all be shown to the user as "identically" named, but I guarantee youthat Pharo will not act sanely with all four of these. Even on Windows,where things are a bit saner (NTFS mandates UTF-16), and where an explicitnormalization form is preferred (NFC), I just explicitly verified that Ican trivially inject other normalization forms into the file system.Thus, you can still have two files named "été" that nevertheless havedifferent names as far as the OS is concerned.

In this case, as far as I can tell, Pharo assumes that all path names areUnicode, and does not do any work to convert strings to or from thevarious normalization schemes (looking in Pathclass>>canonicalizeElements:, Path class>>from:delimiter, andFileSystemStore>>pathFromString: here).


There's therefore a pretty straightforward fix that Pharo could do:

  1. Path would use ByteArrays as the actual canonical store, and
     provide convenience methods to see what the array decodes to
     in various encodings.  The developer and application can make
     decisions about what encoding system they want to use.
  2. The VM likely needs to be modified to handle this (didn't check)

As much as I wish Hilaire provided more details in his bug report, it'sworth keeping in mind that not all users, or even all programmers,understand the full implications of things like how various Unicodenormalization and encoding schemes interact in practice with Unix's veryvague concept of what a file name actually is, so I usually try toapproach these bug reports carefully and with an open mind.


--Benjamin

* On OS X, HFS+ uses UTF-16 with an Apple-specific variant of NFD, whereasI do not believe this holds for e.g. UFS or FUSE-backed file systems, sothings are a bit subtler there, but the general rule holds.

Re: [Pharo-users] Ridiculous we are

Reply via email to