Re: [Python-Dev] PEP 393 Summer of Code Project

Nick Coghlan Wed, 24 Aug 2011 06:08:22 -0700

On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy <[email protected]> wrote:
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototype of different solution to the 'mostly
> BMP chars, few non-BMP chars' case. Rather than expand every character from
> 2 bytes to 4, attach an array cpdex of character (ie code point, not code
> unit) indexes. Then for indexing and slicing, the correction is simple,
> simpler than I first expected:
>  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
> where code-unit-index is the adjusted index into the full underlying
> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
> most of the space penalty and the consequent time penalty of moving more
> bytes around and increasing cache misses.


Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.

Cheers,
Nick.

-- 
Nick Coghlan   |   [email protected]   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 Summer of Code Project

Reply via email to