[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Terry J. Reedy Fri, 26 Aug 2011 20:12:40 -0700

Terry J. Reedy <[email protected]> added the comment:

My proposal is better than log(N) in 2 respects.


1) There need only be a time penalty when there are non-BMP chars and indexing 
currently gives the 'wrong' answer and therefore when a time-penalty should be 
acceptable. Lookup for normal all-BMP strings could remain the same.

2) The penalty is log(K), where K in the number of non-BMP chars. In theory, 
O(logK) is as 'bad' as O(logN), for any fixed ratio K/N. In practice, the 
difference should be noticeable when there are just a few (say .01%) 
extended-range chars.

I am aware that this is an idea for the future, not now.
---

Fixing string iteration on narrow builds to produce code points the same
as with wide builds is easy and costs O(1) per code point (character), which is 
the same as the current cost. Then

>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'a\U0001043c': name(c)
'LATIN SMALL LETTER A'
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    for c in 'a\U0001043c': name(c)
ValueError: no such name

would work like it does on wide builds instead of failing.

I admit that it would be strange to have default iteration produce different 
items than default indexing (and indeed, str currently iterates by sequential 
indexing). But keeping them in sync means that buggy iteration is another cost 
of O(1) indexing.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to