Kang-Hao (Kenny) Lu <[email protected]> added the comment:
Attached patch does the following beyond what the patch from haypo does:
* call the error handler
* reject 0xd800~0xdfff when decoding utf-32
The followings are on my TODO list, although this patch doesn't depend on any
of these and can be reviewed and landed separately:
* make the surrogatepass error handler work for utf-16 and utf-32. (I should
be able to finish this by today)
* fix an error in the error handler for utf-16-le. (In, Python3.2
b'\xdc\x80\x00\x41'.decode('utf-16-be', 'ignore') returns "\x00" instead of "A"
for some reason)
* make unicode_encode_call_errorhandler return bytes so that we can simplify
this patch. (This arguably belongs to a separate bug so I'll file it when
needed)
> All UTF codecs should reject lone surrogates in strict error mode,
Should we really reject lone surrogates for UTF-7? There's a test in
test_codecs.py that tests "\udc80" to be encoded b"+3IA-" (. Given that UTF-7
is not really part of the Unicode Standard and it is more like a "data
encoding" than a "text encoding" to me, I am not sure it's a good idea.
> but let them pass using the surrogatepass error handler (the UTF-8
> codec already does) and apply the usual error handling for ignore
> and replace.
For 'replace', the patch now emits b"\x00?" instead of b"?" so that UTF-16
stream doesn't get corrupted. It is not "usual" and not matching
# Implements the ``replace`` error handling: malformed data is replaced
# with a suitable replacement character such as ``'?'`` in bytestrings
# and ``'\ufffd'`` in Unicode strings.
in the documentation. What do we do? Are there other encodings that are not
ASCII compatible besides UTF-7, UTF-16 and UTF-32 that Python supports? I think
it would be better to use encoded U+fffd whenever possible and fall back to
'?'. What do you think?
Some other self comments on my patch:
* In the STORECHAR macro for utf-16 and utf-32, I change all instances of "ch
& 0xFF" to (unsigned char) ch. I don't have enough C knowledge to know if this
is actually better or if this makes any difference at all.
* The code for utf-16 and utf-32 are duplicates of the uft-8 one. That one's
complexity comes from issue #8092 . Not sure if there are ways to simplify
these. For example, are there suitable functions there so that we don't need to
check integer overflow at these places?
----------
nosy: +kennyluck
Added file: http://bugs.python.org/file24368/utf-16&32_reject_surrogates.patch
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12892>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com