New submission from STINNER Victor <victor.stin...@haypocalc.com>:

Stateful CJK codecs reset the codec at each call to encode() producing a valid 
but overlong output:

>>> import codecs
>>> encoder = codecs.getincrementalencoder('hz')()
>>> encoder.encode('\u804a') + encoder.encode('\u804a')
b'~{AD~}~{AD~}'
>>> '\u804a\u804a'.encode('hz')
b'~{ADAD~}'

Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. 
iso-2022-jp).

Attached patch fixes this issue. I don't like how I added the tests, these 
tests may be moved somewhere else, but HZ codec doesn't have tests today (I 
opened issue #12057 for that), and ISO 2022 codecs don't have specific tests 
(test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe 
also add tests specific to ISO 2022 first?

I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or 
UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls 
encoder.reset() on file.seek(0). On a seek to another position, it calls 
encoder.setstate(0).

See also issues #12016 and #12057.

----------
components: Interpreter Core
files: cjk_no_reset.patch
keywords: patch
messages: 136194
nosy: haypo, hyeshik.chang, lemburg
priority: normal
severity: normal
status: open
title: Incremental encoders of CJK codecs reset the codec at each call to 
encode()
versions: Python 3.3
Added file: http://bugs.python.org/file22017/cjk_no_reset.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12100>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to