Preserving unicode filename encoding

Julien Phalip Sat, 20 Oct 2012 13:53:12 -0700

Hi,

I've noticed that the encoding of non-ascii filenames can be inconsistent 
between platforms when using the built-in open() function to create files.


For example, on a Ubuntu 10.04.4 LTS box, the character u'ş' (u'\u015f') gets 
encoded as u'ş' (u's\u0327'). Note how the two characters look exactly the 
same but are encoded differently. The original character uses only one code 
(u'\u015f'), but the resulting character that is saved on the file system will 
be made of a combination of two codes: the letter 's' followed by a diacritical 
cedilla (u's\u0327'). (You can learn more about diacritics in [1]). On the Mac, 
however, the original encoding is always preserved.

This issue was also discussed in a blog post by Ned Batchelder [2]. One 
suggested approach is to normalize the filename, however this could result in 
loss of information (what if, for example, the original filename did contain 
combining diacritics and we wanted to preserve them).

Ideally, it would be preferable to preserve the original encoding. Is that 
possible or is that completely out of Python's control?

Thanks a lot,

Julien

[1] http://en.wikipedia.org/wiki/Combining_diacritic#Unicode_ranges
[2] http://nedbatchelder.com/blog/201106/filenames_with_accents.html
-- 
http://mail.python.org/mailman/listinfo/python-list

Preserving unicode filename encoding

Reply via email to