Serhiy Storchaka added the comment:
> I'm not sure that multibyte encodings other than UTF-8 are used in the world.
I don't use any of them but I heard some of them are still widely used.
This issue was provoked by issue13612. See also related issue15877.
> pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs
> can be used with your patch?
All codecs which can be supported by expat.
"""
1. Every ASCII character that can appear in a well-formed XML document,
other than the characters
$@\^`{}~
must be represented by a single byte, and that byte must be the
same byte that represents that character in ASCII.
2. No character may require more than 4 bytes to encode.
3. All characters encoded must have Unicode scalar values <=
0xFFFF, (i.e., characters that would be encoded by surrogates in
UTF-16 are not allowed). Note that this restriction doesn't
apply to the built-in support for UTF-8 and UTF-16.
4. No Unicode character may be encoded by more than one distinct
sequence of bytes.
"""
14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949,
cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis,
shift-jis-2004, shift-jisx0213.
> A whitelist of multibyte codecs may be less reliable. What do you think?
pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of
supported encodings with minimal required tables.
pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat
criteria and builds all needed data at first access (tens kilobytes). After
heavy start it works much faster than previous patch.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18059>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com