Hi Piotr, Thanks for giving some serious thoughts to the utf8-stored unicode string proposal!
On 5 March 2016 at 01:48, Piotr Jurkiewicz <piotr.jerzy.jurkiew...@gmail.com> wrote: > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] This is the part I'm least sure about: seek_forward() needs to be a loop over 0 to 63 codepoints. True, each loop can be branchless, and very short---let's say 4 instructions. But it still makes a total of up to 252 instructions (plus the checks to know if we must go on). These instructions are all or almost all dependent on the previous one: you must have finished computing the length of one sequence to even being computing the length of the next one. Maybe it's faster to use a more "XMM-izable" algorithm which counts 0 for each byte in 0x80-0xBF and 1 otherwise, and makes the sum. There are also variants, e.g. adding a second array of words similar to 'index', but where each word is 8 packed bytes giving 8 starting points inside the page (each in range 0-252). This would reduce the walk to 0-7 codepoints. I'm +1 on your proposal. The whole thing is definitely worth a try. A bientôt, Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev