New issue 2389: Different behavior of bytes.decode('utf8', 'custom_replace')
https://bitbucket.org/pypy/pypy/issues/2389/different-behavior-of-bytesdecode-utf8

Konstantin Lopuhin:

The following program:
```
import codecs

codecs.register_error('custom_replace', lambda exc: (u'\ufffd', exc.start+1))

s1 = b"WORD\xe3\xab"
print(repr(s1.decode('utf8', 'custom_replace')))
print(repr(s1.decode('utf8', 'replace')))

s2 = b"\xef\xbb\xbfWORD\xe3\xabWORD2"
print(repr(s2.decode('utf8', 'custom_replace')))
print(repr(s2.decode('utf8', 'replace')))
```
produces different results on CPython 2.7 (I tried 2.7.6 and 2.7.12) and on 
PyPy 5.4.0:

```
$ pypy test.py 
u'WORD\ufffd'
u'WORD\ufffd'
u'\ufeffWORD\ufffd\ufffdWORD2'
u'\ufeffWORD\ufffdWORD2'
$ python test.py 
u'WORD\ufffd\ufffd'
u'WORD\ufffd'
u'\ufeffWORD\ufffd\ufffdWORD2'
u'\ufeffWORD\ufffdWORD2'
```

And I think CPython is more consistent here: with a custom replace function, it 
replaces each invalid byte with given symbol, but PyPy in some cases does a 
different thing.

The context: this code is used in w3lib here 
https://github.com/scrapy/w3lib/blob/v1.14.2/w3lib/encoding.py#L176 (the 
CPython bug reference might be slightly misleading here) to emulate browser 
behavior for invalid utf8 handling, and CPython with custom_replace agrees with 
browser behavior here.



_______________________________________________
pypy-issue mailing list
pypy-issue@python.org
https://mail.python.org/mailman/listinfo/pypy-issue

Reply via email to