On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: > I think you're misunderstanding what we're proposing. > > We're proposing utf8 representation completely hidden from the user, > where everything behaves just like cpython unicode (the len() example > you're showing is a narrow unicode build I presume?)
Yes, CPython narrow builds don't handle Unicode code points in the supplementary planes well: they wrongly return len(2) for code points with a 4-byte UTF-16 representation: steve@runes:~$ python2.6 -c "print len(u'\U0010FFFF')" # wide build 1 steve@runes:~$ python2.7 -c "print len(u'\U0010FFFF')" # narrow build 2 That is no longer the case since Python 3.3, when the "flexible string representation" was introduced. https://www.python.org/dev/peps/pep-0393/ I think that it would be a very valuable experiment for PyPy to investigate moving to a UTF-8 internal representation. -- Steve _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev