On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy <tjre...@udel.edu> wrote:
> In utf16.py, attached to http://bugs.python.org/issue12729
> I propose for consideration a prototype of different solution to the 'mostly
> BMP chars, few non-BMP chars' case. Rather than expand every character from
> 2 bytes to 4, attach an array cpdex of character (ie code point, not code
> unit) indexes. Then for indexing and slicing, the correction is simple,
> simpler than I first expected:
>  code-unit-index = char-index + bisect.bisect_left(cpdex, char_index)
> where code-unit-index is the adjusted index into the full underlying
> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids
> most of the space penalty and the consequent time penalty of moving more
> bytes around and increasing cache misses.

Interesting idea, but putting on my C programmer hat, I say -1.

Non-uniform cell size = not a C array = standard C array manipulation
idioms don't work = pain (no matter how simple the index correction
happens to be).

The nice thing about PEP 383 is that it gives us the smallest storage
array that is both an ordinary C array and has sufficiently large
individual elements to handle every character in the string.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to