[issue44987] Speed up unicode normalization of ASCII strings

Steven D'Aprano Mon, 23 Aug 2021 19:08:29 -0700


New submission from Steven D'Aprano <[email protected]>:


I think there is an opportunity to speed up some unicode normalisations 
significantly.

In 3.9 at least, the normalisation appears to be dependent on the length of the 
string:

    >>> setup="from unicodedata import normalize; s = 'reverse'"
    >>> t1 = Timer('normalize("NFKC", s)', setup=setup)
    >>> setup="from unicodedata import normalize; s = 'reverse'*1000"
    >>> t2 = Timer('normalize("NFKC", s)', setup=setup)
    >>> 
    >>> min(t1.repeat(repeat=7))
    0.04854234401136637
    >>> min(t2.repeat(repeat=7))
    9.98313440399943

But ASCII strings are always in normalised form, for all four normalisation 
forms. In CPython, with PEP 393 (Flexible String Representation), it should be 
a constant-time operation to detect whether a string is pure ASCII, and avoid 
scanning the string or attempting the normalisation.

----------
components: Unicode
messages: 400192
nosy: ezio.melotti, steven.daprano, vstinner
priority: normal
severity: normal
status: open
title: Speed up unicode normalization of ASCII strings
type: enhancement
versions: Python 3.11

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue44987>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue44987] Speed up unicode normalization of ASCII strings

Reply via email to