New issue 2618: incorrect "surrogatepass" encoding with pypy3.5-5.8.0 https://bitbucket.org/pypy/pypy/issues/2618/incorrect-surrogatepass-encoding-with
Cosimo Lupo: Hello, I'm getting different encodings between CPython 3.5.3 and pypy3.5-5.8.0 when the input string contains surrogate escapes. When I roundtrip the string 'Carrot \ud83e\udd55' through "utf_16_be" encoding with errors="surrogatepass", in CPython I correctly get 'Carrot \U0001f955' ``` Python 3.5.3 (default, Jul 18 2017, 13:04:39) [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> 'Carrot \ud83e\udd55'.encode('utf_16_be', errors='surrogatepass') b'\x00C\x00a\x00r\x00r\x00o\x00t\x00 \xd8>\xddU' >>> 'Carrot \ud83e\udd55'.encode('utf_16_be', >>> errors='surrogatepass').decode('utf_16_be') 'Carrot \U0001f955' ``` However, with PyPy3.5 5.8.0, same input and code, I get this: ``` Python 3.5.3 (a37ecfe5f142bc971a86d17305cc5d1d70abec64, Jul 25 2017, 16:48:07) [PyPy 5.8.0-beta0 with GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin Type "help", "copyright", "credits" or "license" for more information. And now for something completely different: ``the future has just begun'' >>>> 'Carrot \ud83e\udd55'.encode('utf_16_be', errors='surrogatepass') b'\x00C\x00a\x00r\x00r\x00o\x00t\x00 >\xd8U\xdd' >>>> 'Carrot \ud83e\udd55'.encode('utf_16_be', >>>> errors='surrogatepass').decode('utf_16_be') 'Carrot 㻘嗝' ``` I'm on macOS 10.12.6, I compiled pypy3 from source, using latest GCC 7.1.0 from homebrew. I haven't had the chance to try on Linux yet. Thanks for your help. _______________________________________________ pypy-issue mailing list pypy-issue@python.org https://mail.python.org/mailman/listinfo/pypy-issue