Steve Dower <[email protected]> added the comment:
The benchmark may not be triggering that much work. NFKC normalization only
applies for characters outside of the basic Latin characters (0-255).
I ran the below benchmarks and saw a huge difference. Granted, it's a very
degenerate case with collections this big, but it appears to be linear with
len(NAMES), suggesting that the normalization is the expensive part.
>>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()]
>>> def makename():
... return ''.join(random.choice(CHRS) for _ in range(10))
...
>>> NAMES = [makename() for _ in range(10000)]
>>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000)
38.04007526000004
>>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))',
>>> globals=globals(), number=100000)
820.2586788580002
I wonder if it's better to catch the SyntaxError and do the check there? That
way we don't really have a performance impact, since it's only going to show up
in exceptional cases anyway.
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue33881>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com