"Martin v. Löwis" <[EMAIL PROTECTED]> writes:
> Marcin 'Qrczak' Kowalczyk schrieb:
>> It is true that it can change the interpretation of file contents.
>> This is unavoidable. Unless someone uses unpaired surrogates for this
>> purpose (or code points above U+10FFFF) - I've seen such proposals,
>> but IMHO they are abusing rules too far.
>
> It's not exactly unavoidable: any escaping mechanism can support the
> full range of valid input. In your escaping mechanism, you could
> duplicate 0 bytes on decoding, and write a null byte if you have two
> subsequent NUL characters on encoding.
This is exactly what I am doing. The encoding is able to decode
arbitrary byte sequences, including '\0' bytes, and encodes them back
losslessly.
The point is that it differs from true UTF-8 for strings which contain
'\0' or U+0000. It's unavoidable that it differs from UTF-8 for some
strings, unless code points not encodable in UTF-8 are used.
It doesn't differ from true UTF-8 when there is no '\0' or U+0000.
The fact that it doesn't differ from UTF-8 for some strings means that
for such strings it fires only when UTF-8 decoder would have reported
an error, i.e. that it only changes the behavior of code which would
fail otherwise, that it doesn't break what would work in UTF-8.
My encoder is injective: it accepts U+0000 prefixes only in sequences
which would have been invalid UTF-8.
I agree that it's not suitable for showing the filename for a user.
> I still think that PUA characters would be a better use
What if the filename contains the correct UTF-8 encoding of such PUA
character?
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe:
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com