Hi all, I just wanted to mention that several other language implementors have faced the same problem of dealing with "UTF-16" containing lone surrogate code points and representing it in "UTF-8", and they have come up with essentially the same solution. Users include the Racket, Scheme 48, and Rust languages (all three only for I/O on Windows) and the Servo browser engine (for the sake of JavaScript). Recently Simon Sapin of Mozilla has spec'd this trick in exhausting detail, christening it WTF-8: https://simonsapin.github.io/wtf-8/ While everything described there may be pretty obvious (for those immersed in the guts of Unicode), I wanted to raise awareness that this has a name and other users.
Cheers, Robin On 8 March 2016 at 16:16, Armin Rigo <ar...@tunes.org> wrote: > Hi hubo, > > On 7 March 2016 at 13:49, hubo <h...@jiedaibao.com> wrote: > > I think in Python 3.x, u'\ud805\udc09' is not another format of > > u'\U00011409', it is just an illegal unicode string. It also raises > > UnicodeEncodeError if you try to encode it into UTF-8. The problem is > that > > it is legal to define and use these strings. If PyPy uses UTF-8 or > UTF-16 as > > the internal storage format, I don't think it is possible to keep these > > details same as CPython, but it should be acceptable. > > We're good at keeping obscure details the same as CPython. It's only > a matter of adding the correct checks on top of the encode() and > decode() methods, independently of the underlying representation. > > In this case, because we can consider the length-1 unicode string > u'\ud805', then we have to internally represent it somehow, and the > natural way would be to represent it as the 3 bytes '\xed\xa0\x85'. > So for u'\ud805\udc09' we use 6 bytes. Strictly speaking, we're thus > not using utf-8 internally, but > "utf-8-without-extra-consistency-checks". In Python 2, > u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a > single code point of 4 bytes. This means that calling > ``decode('utf-8')`` has to check for surrogates, and do something more > complicated on Python 2.x (or complain on Python 3.x). In other > words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be > no-ops. Decoding and encoding need to check the data, and might > actually need to make a copy in corner cases, but not in the vast > majority of cases. > > This is all focused on the web and generally Linux approach of "utf-8 > everywhere". For Windows, the story is more complicated. CPython 2.x > uses UTF-16, like the Windows API. However, the recent CPython 3.x > moved anyway towards a variable-encoding model of UCS-4 (==UTF-32). > If you are on a recent CPython 3.x and build a unicode object with a > large codepoint, and then call the Windows API with it, it will need > anyway to convert it to UTF-16 dynamically, as far as I can > tell---i.e. convert from UCS-4 to UTF-16. In the proposal that is > discussed here, it would instead have to convert from > utf-8-without-extra-consistency-checks to UTF-16 in that situation. > > There are definitely trade-offs to explore, but I doubt that we can > fully explore these trade-offs without actually trying it out. > > > A bientôt, > > Armin. > _______________________________________________ > pypy-dev mailing list > pypy-dev@python.org > https://mail.python.org/mailman/listinfo/pypy-dev >
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev