On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote: > On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: > > > > * If you can pick a set of encodings that are valid (utf-8 for Linux and > > MacOS > > HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right > here you've already broken Python modules on OSX. > Others have been saying that Mac OSX's HFS+ uses UTF-8. But the question is not whether UTF-16 or UTF-8 is used by HFS+. It's whether you can sensibly decide on an encoding from the type of system that is being run on. This could be querying the filesystem or a check on sys.platform or some other method. I don't know what detection the current code does.
On Linux there's no defined encoding that will work; file names are just bytes to the Linux kernel so based on people's argument that the convention is and should be that filenames are utf-8 and anything else is a misconfigured system -- python should mandate that its module filenames on Linux are utf-8 rather than using the user's locale settings. > > And as far as I know, Linux software/FS generally use NFC (I've already seen > this issue cause trouble) > Linux FS's are bytes with a small blacklist (so you can't use the NULL byte in a filename, for instance). Linux software would be free to use any normal form that they want. If one software used NFC and another used NFD, the FS would record two separate files with two separate filenames. Other programs might or might not display this correctly. Example: <zsh>$ touch cafe <zsh>$ python Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) >>> import os >>> import unicodedata >>> a=u'café' >>> b=unicodedata.normalize('NFC', a) >>> c=unicodedata.normalize('NFD', a) >>> open(b.encode('utf8'), 'w').close() >>> open(c.encode('utf8'), 'w').close() >>> os.listdir(u'.') >>> [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', >>> u'people-etc-changes.sha256sum', u'caf\xe9'] >>> os.listdir('.') >>> ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', >>> 'people-etc-changes.sha256sum', 'caf\xc3\xa9'] >>> ^D <zsh>$ ls -al . drwxrwxr-x. 2 badger badger 4096 Jan 25 07:46 . drwxr-xr-x. 17 badger badger 4096 Jan 24 18:27 .. -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 café <zsh>$ ls -al cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe <zsh>$ ls -al cafe? -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe Now in this case, the decomposed form of the filename is being displayed incorrectly and the shell treats the decomposed character as two characters instead of one. However, when you view these files in dolphin (the KDE file manager) you properly see café repeated twice. -Toshio
pgp2jXsIKYdB7.pgp
Description: PGP signature
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com