Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Steven D'Aprano Mon, 07 Mar 2016 03:47:04 -0800

On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote:
> I think you're misunderstanding what we're proposing.
> 
> We're proposing utf8 representation completely hidden from the user,
> where everything behaves just like cpython unicode (the len() example
> you're showing is a narrow unicode build I presume?)


Yes, CPython narrow builds don't handle Unicode code points in the 
supplementary planes well: they wrongly return len(2) for code points 
with a 4-byte UTF-16 representation:

steve@runes:~$ python2.6 -c "print len(u'\U0010FFFF')"  # wide build
1
steve@runes:~$ python2.7 -c "print len(u'\U0010FFFF')"  # narrow build
2


That is no longer the case since Python 3.3, when the "flexible 
string representation" was introduced.

https://www.python.org/dev/peps/pep-0393/

I think that it would be a very valuable experiment for PyPy to 
investigate moving to a UTF-8 internal representation.



-- 
Steve
_______________________________________________
pypy-dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Reply via email to