Terry J. Reedy <[email protected]> added the comment:
My proposal is better than log(N) in 2 respects.
1) There need only be a time penalty when there are non-BMP chars and indexing
currently gives the 'wrong' answer and therefore when a time-penalty should be
acceptable. Lookup for normal all-BMP strings could remain the same.
2) The penalty is log(K), where K in the number of non-BMP chars. In theory,
O(logK) is as 'bad' as O(logN), for any fixed ratio K/N. In practice, the
difference should be noticeable when there are just a few (say .01%)
extended-range chars.
I am aware that this is an idea for the future, not now.
---
Fixing string iteration on narrow builds to produce code points the same
as with wide builds is easy and costs O(1) per code point (character), which is
the same as the current cost. Then
>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'a\U0001043c': name(c)
'LATIN SMALL LETTER A'
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
for c in 'a\U0001043c': name(c)
ValueError: no such name
would work like it does on wide builds instead of failing.
I admit that it would be strange to have default iteration produce different
items than default indexing (and indeed, str currently iterates by sequential
indexing). But keeping them in sync means that buggy iteration is another cost
of O(1) indexing.
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com