Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Armin Rigo Sat, 05 Mar 2016 00:22:05 -0800

Hi Piotr,

Thanks for giving some serious thoughts to the utf8-stored unicode
string proposal!

On 5 March 2016 at 01:48, Piotr Jurkiewicz
<[email protected]> wrote:
>     Random access would be as follows:
>
>         page_num, byte_in_page = divmod(codepoint_pos, 64)
>         page_start_byte = index[page_num]
>         exact_byte = seek_forward(buffer[page_start_byte], byte_in_page)
>         return buffer[exact_byte]

This is the part I'm least sure about: seek_forward() needs to be a
loop over 0 to 63 codepoints.  True, each loop can be branchless, and
very short---let's say 4 instructions.  But it still makes a total of
up to 252 instructions (plus the checks to know if we must go on).
These instructions are all or almost all dependent on the previous
one: you must have finished computing the length of one sequence to
even being computing the length of the next one.  Maybe it's faster to
use a more "XMM-izable" algorithm which counts 0 for each byte in
0x80-0xBF and 1 otherwise, and makes the sum.

There are also variants, e.g. adding a second array of words similar
to 'index', but where each word is 8 packed bytes giving 8 starting
points inside the page (each in range 0-252).  This would reduce the
walk to 0-7 codepoints.

I'm +1 on your proposal. The whole thing is definitely worth a try.

A bientôt,

Armin.
_______________________________________________
pypy-dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage

Reply via email to