[issue18059] Add multibyte encoding support to pyexpat

Stefan Behnel Fri, 13 Sep 2013 22:26:59 -0700

Stefan Behnel added the comment:

I don't think I have my head deep enough in the encodings implementation to say 
that this is the correct/best way to do it, but the patch looks mostly 
reasonable to me and would be a helpful addition.


I have two comments on the pyexpat_encoding_convert() function:

1) I can't see a safe-guard against reading beyond the data buffer. What if s 
already points to the last byte and we are trying to read two or three bytes to 
decode them? I wouldn't be surprised to see that this kind of input can be 
crafted.

2) Creating a throw-away Unicode object through a named decoder looks like a 
huge overhead for decoding two bytes. It might be considered an optimisation to 
change that, but if you are really trying to parse a longer XML document with 
lots of Japanese text in it (i.e. if you actually *need* this feature), it will 
most likely end up being way too slow to make any real use of it.

I think that both points should be addressed before this gets added.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue18059>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18059] Add multibyte encoding support to pyexpat

Reply via email to