New submission from Steven D'Aprano <[email protected]>:
I think there is an opportunity to speed up some unicode normalisations
significantly.
In 3.9 at least, the normalisation appears to be dependent on the length of the
string:
>>> setup="from unicodedata import normalize; s = 'reverse'"
>>> t1 = Timer('normalize("NFKC", s)', setup=setup)
>>> setup="from unicodedata import normalize; s = 'reverse'*1000"
>>> t2 = Timer('normalize("NFKC", s)', setup=setup)
>>>
>>> min(t1.repeat(repeat=7))
0.04854234401136637
>>> min(t2.repeat(repeat=7))
9.98313440399943
But ASCII strings are always in normalised form, for all four normalisation
forms. In CPython, with PEP 393 (Flexible String Representation), it should be
a constant-time operation to detect whether a string is pure ASCII, and avoid
scanning the string or attempting the normalisation.
----------
components: Unicode
messages: 400192
nosy: ezio.melotti, steven.daprano, vstinner
priority: normal
severity: normal
status: open
title: Speed up unicode normalization of ASCII strings
type: enhancement
versions: Python 3.11
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue44987>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com