Marcin 'Qrczak' Kowalczyk schrieb: > It is true that it can change the interpretation of file contents. > This is unavoidable. Unless someone uses unpaired surrogates for this > purpose (or code points above U+10FFFF) - I've seen such proposals, > but IMHO they are abusing rules too far.
It's not exactly unavoidable: any escaping mechanism can support the full range of valid input. In your escaping mechanism, you could duplicate 0 bytes on decoding, and write a null byte if you have two subsequent NUL characters on encoding. I still think that PUA characters would be a better use: in your encoding, you get two characters of encoded text for one byte of input; if people need to render the file name, this will be confusing. With a PUA character, rendering will still produce moji-bake, but you will likely get one "box" of output for what the user thinks should be one character. Refining my last proposal: I think there should be a "pass-through" error handler for codecs which puts undecodable bytes into PUA characters, and encodes unencodable characters from the PUA range into the corresponding bytes. This could lie on top of existing codecs, and help to decode undecodable file names in a way that round-trips. Regards, Martin _______________________________________________ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com