Thanks for the link! It is interesting that in Python3.5, still
>>> len(u'\ud805\udc09') 2 >>> u'\ud805\udc09' == u'\U00011409' False I think in Python 3.x, u'\ud805\udc09' is not another format of u'\U00011409', it is just an illegal unicode string. It also raises UnicodeEncodeError if you try to encode it into UTF-8. The problem is that it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as the internal storage format, I don't think it is possible to keep these details same as CPython, but it should be acceptable. Thanks again for the discussion. Unicode is really complicated. 2016-03-07 hubo 发件人:Steven D'Aprano <st...@pearwood.info> 发送时间:2016-03-07 19:45 主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage 收件人:"pypy-dev"<pypy-dev@python.org> 抄送: On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: > I think you're misunderstanding what we're proposing. > > We're proposing utf8 representation completely hidden from the user, > where everything behaves just like cpython unicode (the len() example > you're showing is a narrow unicode build I presume?) Yes, CPython narrow builds don't handle Unicode code points in the supplementary planes well: they wrongly return len(2) for code points with a 4-byte UTF-16 representation: steve@runes:~$ python2.6 -c "print len(u'\U0010FFFF')" # wide build 1 steve@runes:~$ python2.7 -c "print len(u'\U0010FFFF')" # narrow build 2 That is no longer the case since Python 3.3, when the "flexible string representation" was introduced. https://www.python.org/dev/peps/pep-0393/ I think that it would be a very valuable experiment for PyPy to investigate moving to a UTF-8 internal representation. -- Steve _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev