I think it is not reasonable to use UTF-8 to represent the unicode string type.


1. Less storage - this is not always true. It is only true for strings with a 
lot of ASCII characters. In Asia, most strings in local languages (Japanese, 
Chinese, Korean) are non-ASCII characters, they may consume more storage than 
in UTF-16. To make things worse, while it always consumes 2*N bytes for a 
N-characters string in UTF-16, it is difficult to estimate the size of a 
N-characters string in UTF-8 (may be N bytes to 3 * N bytes)
(UTF-16 also has two-word characters, but len() reports 2 for these characters, 
I think it is not harmful to treat them as two characters)

2. There would be very complicate logics for size calculating and slicing. For 
UTF-16, every character is represented with a 16-bit integer, so it is convient 
for size calculating and slicing. But character in UTF-8 consumes variant 
bytes, so either we call mb_* string functions instead (which is slow in 
nature) or we use special logic like storing indices of characters in another 
array (which introduces cost for extra addressings).
 
3. When displaying with repr(), non-ASCII characters are displayed with \uXXXX 
format. If the internal storage for unicode is UTF-8, the only way to be 
compatible with this format is to convert it back to UTF-16.

It may be wiser to let programmers deside which encoding they would like to 
use. If they want to process UTF-8 strings without performance cost on 
converting, they should use "bytes". When correct size calculating and slicing 
of non-ASCII characters are concerned it may be better to use "unicode".

2016-03-07 

hubo 



发件人:Armin Rigo <ar...@tunes.org>
发送时间:2016-03-05 16:09
主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage
收件人:"Piotr Jurkiewicz"<piotr.jerzy.jurkiew...@gmail.com>
抄送:"PyPy Developer Mailing List"<pypy-dev@python.org>

Hi Piotr, 

Thanks for giving some serious thoughts to the utf8-stored unicode 
string proposal! 

On 5 March 2016 at 01:48, Piotr Jurkiewicz 
<piotr.jerzy.jurkiew...@gmail.com> wrote: 
>     Random access would be as follows: 
> 
>         page_num, byte_in_page = divmod(codepoint_pos, 64) 
>         page_start_byte = index[page_num] 
>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) 
>         return buffer[exact_byte] 

This is the part I'm least sure about: seek_forward() needs to be a 
loop over 0 to 63 codepoints.  True, each loop can be branchless, and 
very short---let's say 4 instructions.  But it still makes a total of 
up to 252 instructions (plus the checks to know if we must go on). 
These instructions are all or almost all dependent on the previous 
one: you must have finished computing the length of one sequence to 
even being computing the length of the next one.  Maybe it's faster to 
use a more "XMM-izable" algorithm which counts 0 for each byte in 
0x80-0xBF and 1 otherwise, and makes the sum. 

There are also variants, e.g. adding a second array of words similar 
to 'index', but where each word is 8 packed bytes giving 8 starting 
points inside the page (each in range 0-252).  This would reduce the 
walk to 0-7 codepoints. 

I'm +1 on your proposal. The whole thing is definitely worth a try. 


A bientôt, 

Armin. 
_______________________________________________ 
pypy-dev mailing list 
pypy-dev@python.org 
https://mail.python.org/mailman/listinfo/pypy-dev 
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to