On Tue, Mar 29, 2011 at 10:55:47PM +0200, Victor Stinner wrote: > Le mardi 29 mars 2011 à 22:40 +0200, Lennart Regebro a écrit : > > The lesson here seems to be "if you have to use blacklists, and you > > use unicode strings for those blacklists, also make sure the string > > you compare with doesn't have surrogates". > > No. '\u4f60\u597d'.encode('big5').decode('latin1') gives '§A¦n' which > doesn't contain any surrogate character. > > The lesson is: if you compare Unicode filenames on UNIX, make sure that > your system is correctly configured (the locale encoding must be the > filesystem encoding). > You're both wrong :-)
Lennart is missing that you just need to use the same encoding + surrogateescape (or stick with bytes) for decoding the byte strings that you are comparing. You're missing that on UNIX there is no filesystem encoding so the idea of locale and filesystem encoding matching is false (and unnecessary -- the encodings that you use within python just need to be the same. They don't even need to match up to the reality of what's used on the filesystem or the user's locale.) -Toshio
pgpbDIzKAesS3.pgp
Description: PGP signature
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com