Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

hubo Mon, 07 Mar 2016 04:50:08 -0800

Thanks for the link!

It is interesting that in Python3.5, still

>>> len(u'\ud805\udc09')
2
>>> u'\ud805\udc09' == u'\U00011409'
False 

I think in Python 3.x, u'\ud805\udc09' is not another format of u'\U00011409', 
it is just an illegal unicode string. It also raises UnicodeEncodeError if you 
try to encode it into UTF-8. The problem is that it is legal to define and use 
these strings. If PyPy uses UTF-8 or UTF-16 as the internal storage format, I 
don't think it is possible to keep these details same as CPython, but it should 
be acceptable.

Thanks again for the discussion. Unicode is really complicated.

2016-03-07 

hubo 

发件人：Steven D'Aprano <[email protected]>
发送时间：2016-03-07 19:45
主题：Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
收件人："pypy-dev"<[email protected]>
抄送：

On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: 
> I think you're misunderstanding what we're proposing. 
>  
> We're proposing utf8 representation completely hidden from the user, 
> where everything behaves just like cpython unicode (the len() example 
> you're showing is a narrow unicode build I presume?) 

Yes, CPython narrow builds don't handle Unicode code points in the  
supplementary planes well: they wrongly return len(2) for code points  
with a 4-byte UTF-16 representation: 

steve@runes:~$ python2.6 -c "print len(u'\U0010FFFF')"  # wide build 
1 
steve@runes:~$ python2.7 -c "print len(u'\U0010FFFF')"  # narrow build 
2 

That is no longer the case since Python 3.3, when the "flexible  
string representation" was introduced. 

https://www.python.org/dev/peps/pep-0393/ 

I think that it would be a very valuable experiment for PyPy to  
investigate moving to a UTF-8 internal representation. 

--  
Steve 
_______________________________________________ 
pypy-dev mailing list 
[email protected] 
https://mail.python.org/mailman/listinfo/pypy-dev

_______________________________________________
pypy-dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Reply via email to