I think it is not reasonable to use UTF-8 to represent the unicode string type.
1. Less storage - this is not always true. It is only true for strings with a lot of ASCII characters. In Asia, most strings in local languages (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more storage than in UTF-16. To make things worse, while it always consumes 2*N bytes for a N-characters string in UTF-16, it is difficult to estimate the size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) (UTF-16 also has two-word characters, but len() reports 2 for these characters, I think it is not harmful to treat them as two characters) 2. There would be very complicate logics for size calculating and slicing. For UTF-16, every character is represented with a 16-bit integer, so it is convient for size calculating and slicing. But character in UTF-8 consumes variant bytes, so either we call mb_* string functions instead (which is slow in nature) or we use special logic like storing indices of characters in another array (which introduces cost for extra addressings). 3. When displaying with repr(), non-ASCII characters are displayed with \uXXXX format. If the internal storage for unicode is UTF-8, the only way to be compatible with this format is to convert it back to UTF-16. It may be wiser to let programmers deside which encoding they would like to use. If they want to process UTF-8 strings without performance cost on converting, they should use "bytes". When correct size calculating and slicing of non-ASCII characters are concerned it may be better to use "unicode". 2016-03-07 hubo 发件人:Armin Rigo <ar...@tunes.org> 发送时间:2016-03-05 16:09 主题:Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage 收件人:"Piotr Jurkiewicz"<piotr.jerzy.jurkiew...@gmail.com> 抄送:"PyPy Developer Mailing List"<pypy-dev@python.org> Hi Piotr, Thanks for giving some serious thoughts to the utf8-stored unicode string proposal! On 5 March 2016 at 01:48, Piotr Jurkiewicz <piotr.jerzy.jurkiew...@gmail.com> wrote: > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] This is the part I'm least sure about: seek_forward() needs to be a loop over 0 to 63 codepoints. True, each loop can be branchless, and very short---let's say 4 instructions. But it still makes a total of up to 252 instructions (plus the checks to know if we must go on). These instructions are all or almost all dependent on the previous one: you must have finished computing the length of one sequence to even being computing the length of the next one. Maybe it's faster to use a more "XMM-izable" algorithm which counts 0 for each byte in 0x80-0xBF and 1 otherwise, and makes the sum. There are also variants, e.g. adding a second array of words similar to 'index', but where each word is 8 packed bytes giving 8 starting points inside the page (each in range 0-252). This would reduce the walk to 0-7 codepoints. I'm +1 on your proposal. The whole thing is definitely worth a try. A bientôt, Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev