Serhiy Storchaka added the comment:

> I'm not sure that multibyte encodings other than UTF-8 are used in the world.

I don't use any of them but I heard some of them are still widely used.

This issue was provoked by issue13612. See also related issue15877.

> pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs 
> can be used with your patch?

All codecs which can be supported by expat.

   1. Every ASCII character that can appear in a well-formed XML document,
      other than the characters


      must be represented by a single byte, and that byte must be the
      same byte that represents that character in ASCII.

   2. No character may require more than 4 bytes to encode.

   3. All characters encoded must have Unicode scalar values <=
      0xFFFF, (i.e., characters that would be encoded by surrogates in
      UTF-16 are  not allowed).  Note that this restriction doesn't
      apply to the built-in support for UTF-8 and UTF-16.

   4. No Unicode character may be encoded by more than one distinct
      sequence of bytes.

14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, 
cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, 
shift-jis-2004, shift-jisx0213.

> A whitelist of multibyte codecs may be less reliable. What do you think?

pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of 
supported encodings with minimal required tables.

pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat 
criteria and builds all needed data at first access (tens kilobytes). After 
heavy start it works much faster than previous patch.


Python tracker <>
Python-bugs-list mailing list

Reply via email to