Adam Olsen wrote: > Lossy conversion just moves around what gets treated as garbage. As > all valid unicode scalars can be round tripped, there's no way to > create a valid unicode file name without being lossy. The alternative > is not be valid unicode, but since we can't use such objects with > external libs, can't even print them, we might as well call them > something else. We already have a name for that: bytes.
To my mind, there are two kinds of app in the world when it comes to file paths: 1) "Normal" apps (e.g. a word processor), that are only interested in files with sane, well-formed file names that can be properly decoded to Unicode with the filesystem encoding identified by Python. If there is invalid data on the filesystem, they don't care and don't want to see it or have to deal with it. 2) "Filesystem" apps (e.g. a filesystem explorer), that need to be able to deal with malformed filenames that may not decode properly using the identified filesystem encoding. For the former category of apps, the presence of a malformed filename should NOT disrupt the processing of well-formed files and directories. Those applications should "just work", even if the underlying filesystem has a few broken filenames. The latter category of applications need some way of defining their own application-specific handling of malformed names. That screams "callback" to me - and one mechanism to achieve that would be to expose the unicode "errors" argument for filesystem operations that return file paths (e.g. os.getcwd(), os.listdir(), os.readlink(), os.walk()). Once that was exposed, the existing error handling machinery in the codecs module could be used to allow applications to define their own custom error handling for Unicode decode errors in these operations. (e.g. set "codecs.register_error('bad_filepath', handle_filepath_error)", then use "errors='bad_filepath'" in the relevant os API calls) The default handling could be left at "strict", with os.listdir() and os.walk() specifically ignoring path entries that trigger UnicodeDecodeError. getcwd() and readlink() could just propagate the exception, since they have no other information to return. Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com