New submission from STINNER Victor <victor.stin...@haypocalc.com>: Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output:
>>> import codecs >>> encoder = codecs.getincrementalencoder('hz')() >>> encoder.encode('\u804a') + encoder.encode('\u804a') b'~{AD~}~{AD~}' >>> '\u804a\u804a'.encode('hz') b'~{ADAD~}' Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp). Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue #12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first? I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0). See also issues #12016 and #12057. ---------- components: Interpreter Core files: cjk_no_reset.patch keywords: patch messages: 136194 nosy: haypo, hyeshik.chang, lemburg priority: normal severity: normal status: open title: Incremental encoders of CJK codecs reset the codec at each call to encode() versions: Python 3.3 Added file: http://bugs.python.org/file22017/cjk_no_reset.patch _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12100> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com