[issue33881] dataclasses should use NFKC to find duplicate members

Steve Dower Mon, 25 Jun 2018 16:22:18 -0700


Steve Dower <[email protected]> added the comment:


The benchmark may not be triggering that much work. NFKC normalization only 
applies for characters outside of the basic Latin characters (0-255).

I ran the below benchmarks and saw a huge difference. Granted, it's a very 
degenerate case with collections this big, but it appears to be linear with 
len(NAMES), suggesting that the normalization is the expensive part.

>>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()]
>>> def makename():
...  return ''.join(random.choice(CHRS) for _ in range(10))
...
>>> NAMES = [makename() for _ in range(10000)]
>>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000)
38.04007526000004
>>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', 
>>> globals=globals(), number=100000)
820.2586788580002

I wonder if it's better to catch the SyntaxError and do the check there? That 
way we don't really have a performance impact, since it's only going to show up 
in exceptional cases anyway.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue33881>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue33881] dataclasses should use NFKC to find duplicate members

Reply via email to