Hi hubo, On 7 March 2016 at 13:49, hubo <h...@jiedaibao.com> wrote: > I think in Python 3.x, u'\ud805\udc09' is not another format of > u'\U00011409', it is just an illegal unicode string. It also raises > UnicodeEncodeError if you try to encode it into UTF-8. The problem is that > it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as > the internal storage format, I don't think it is possible to keep these > details same as CPython, but it should be acceptable.
We're good at keeping obscure details the same as CPython. It's only a matter of adding the correct checks on top of the encode() and decode() methods, independently of the underlying representation. In this case, because we can consider the length-1 unicode string u'\ud805', then we have to internally represent it somehow, and the natural way would be to represent it as the 3 bytes '\xed\xa0\x85'. So for u'\ud805\udc09' we use 6 bytes. Strictly speaking, we're thus not using utf-8 internally, but "utf-8-without-extra-consistency-checks". In Python 2, u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a single code point of 4 bytes. This means that calling ``decode('utf-8')`` has to check for surrogates, and do something more complicated on Python 2.x (or complain on Python 3.x). In other words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be no-ops. Decoding and encoding need to check the data, and might actually need to make a copy in corner cases, but not in the vast majority of cases. This is all focused on the web and generally Linux approach of "utf-8 everywhere". For Windows, the story is more complicated. CPython 2.x uses UTF-16, like the Windows API. However, the recent CPython 3.x moved anyway towards a variable-encoding model of UCS-4 (==UTF-32). If you are on a recent CPython 3.x and build a unicode object with a large codepoint, and then call the Windows API with it, it will need anyway to convert it to UTF-16 dynamically, as far as I can tell---i.e. convert from UCS-4 to UTF-16. In the proposal that is discussed here, it would instead have to convert from utf-8-without-extra-consistency-checks to UTF-16 in that situation. There are definitely trade-offs to explore, but I doubt that we can fully explore these trade-offs without actually trying it out. A bientôt, Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev