Hi, On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote: > What should happen when a command line argument or an environment > variable is not decodable using the system encoding (on Unix where > from the OS point of view it is an array of bytes)?
On Linux, filenames are *byte* string and not *character* string. I always have his problem with Python 2.x. I converted filename (argv[x]) to Unicode to be able to format error messages in full unicode... but it's not possible. Linux allows invalid utf8 filename even on full utf8 installation (ubuntu), see Marcin's examples. So I propose to keep sys.argv as byte string array. If you try to create unicode strings, you will be unable to write a program to convert filesystem with "broken" filenames (see convmv program for example) or open file with broken "filename" (broken: invalid byte sequence for UTF/JIS/Big5/... charset). --- For Python 2.x, my solution is to keep byte string for I/O and use unicode string for error messages. Function to convert any byte string (filename string) to Unicode: def unicodeFilename(filename, charset=None): if not charset: charset = getTerminalCharset() try: return unicode(filename, charset) except UnicodeDecodeError: return makePrintable(filename, charset, to_unicode=True) makePrintable() replace invalid byte sequence by escape string, example: >>> from hachoir_core.tools import makePrintable >>> makePrintable("a\x80", "utf8", to_unicode=True) u'a\\x80' >>> print makePrintable("a\x80", "utf8", to_unicode=True) a\x80 Source code of function makePrintable: http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/tools.py#L225 Source code of function getTerminalCharset(): http://hachoir.org/browser/trunk/hachoir-core/hachoir_core/i18n.py#L23 Victor Stinner http://hachoir.org/ _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com