New issue 2389: Different behavior of bytes.decode('utf8', 'custom_replace') https://bitbucket.org/pypy/pypy/issues/2389/different-behavior-of-bytesdecode-utf8
Konstantin Lopuhin: The following program: ``` import codecs codecs.register_error('custom_replace', lambda exc: (u'\ufffd', exc.start+1)) s1 = b"WORD\xe3\xab" print(repr(s1.decode('utf8', 'custom_replace'))) print(repr(s1.decode('utf8', 'replace'))) s2 = b"\xef\xbb\xbfWORD\xe3\xabWORD2" print(repr(s2.decode('utf8', 'custom_replace'))) print(repr(s2.decode('utf8', 'replace'))) ``` produces different results on CPython 2.7 (I tried 2.7.6 and 2.7.12) and on PyPy 5.4.0: ``` $ pypy test.py u'WORD\ufffd' u'WORD\ufffd' u'\ufeffWORD\ufffd\ufffdWORD2' u'\ufeffWORD\ufffdWORD2' $ python test.py u'WORD\ufffd\ufffd' u'WORD\ufffd' u'\ufeffWORD\ufffd\ufffdWORD2' u'\ufeffWORD\ufffdWORD2' ``` And I think CPython is more consistent here: with a custom replace function, it replaces each invalid byte with given symbol, but PyPy in some cases does a different thing. The context: this code is used in w3lib here https://github.com/scrapy/w3lib/blob/v1.14.2/w3lib/encoding.py#L176 (the CPython bug reference might be slightly misleading here) to emulate browser behavior for invalid utf8 handling, and CPython with custom_replace agrees with browser behavior here. _______________________________________________ pypy-issue mailing list pypy-issue@python.org https://mail.python.org/mailman/listinfo/pypy-issue