[issue37966] is_normalized is much slower than the standard's algorithm

Greg Price Tue, 27 Aug 2019 20:54:10 -0700


New submission from Greg Price <[email protected]>:


In 3.8 we add a new function `unicodedata.is_normalized`.  The result is 
equivalent to `str == unicodedata.normalize(form, str)`, but the implementation 
uses a version of the "quick check" algorithm from UAX #15 as an optimization 
to try to avoid having to copy the whole string.  This was added in issue 
#32285, commit 2810dd7be.

However, it turns out the code doesn't actually implement the same algorithm as 
UAX #15, and as a result we often miss the optimization and end up having to 
compute the whole normalized string after all.

Here's a quick demo on my desktop.  We pass a long string made entirely out of 
a character for which the quick-check algorithm always says `NO`, it's not 
normalized:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    's == unicodedata.normalize("NFD", s)'
50 loops, best of 5: 4.41 msec per loop

That's the same 4.4 ms (for a 1 MB string) with or without the attempted 
optimization.

Here it is after a patch that makes the algorithm run as in the standard:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' -- \
    'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

Nearly 5 orders of magnitude faster -- the difference between O(N) and O(1).

The root cause of the issue is that our `is_normalized` static helper, which 
the new function relies on, was never written as a full implementation of the 
quick-check algorithm.  The full algorithm can return YES, MAYBE, or NO; but 
originally this helper's only caller was the implementation of 
`unicodedata.normalize`, which only cares about YES vs. MAYBE-or-NO.  So the 
helper often returns MAYBE when the standard algorithm would say NO.

(More precisely, perhaps: it's fine that this helper was never a full 
implementation... but it didn't say that anywhere, even while referring to the 
standard algorithm, and as a result set us up for future confusion.)

That's exactly what's happening in the example above: the standard quick-check 
algorithm would say NO, but our helper says MAYBE.  Which for 
`unicodedata.is_normalized` means it has to go compute the whole normalized 
string.

----------
messages: 350651
nosy: Greg Price
priority: normal
severity: normal
status: open
title: is_normalized is much slower than the standard's algorithm
versions: Python 3.8

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37966>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue37966] is_normalized is much slower than the standard's algorithm

Reply via email to