[Python-ideas] Re: Add a .whitespace property to module unicodedata

Chris Angelico Fri, 02 Jun 2023 17:23:08 -0700

On Sat, 3 Jun 2023 at 10:12, David Mertz, Ph.D. <david.me...@gmail.com> wrote:
>
> Let's call the styles a tie.  Using the SOWPODS scrabble wordlist (no
> currency symbols, so False answer):
>
> >>> unicode_currency = {chr(c) for c in range(0xFFFF) if 
> >>> unicodedata.category(chr(c)) == "Sc"}
> >>> wordlist = open('/usr/local/share/sowpods').read()
> >>> len(wordlist)
> 2707021
> >>> %timeit any(unicodedata.category(ch) == "Sc" for ch in wordlist)
> 176 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> %timeit any(unicodedata.category(ch) == "Sc" for ch in set(wordlist))
> 17.8 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> >>> bool(set(wordlist) & unicode_currency)
> False
> >>> %timeit bool(set(wordlist) & unicode_currency)
> 18 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>
> Of course, this is a small character set of 26 lowercase letters (and
> newline as I did it).  A more diverse alphabet might tip the timing
> slightly, but it's going to be a small matter either way.
>


Remember though, the original request was not for a set, but for a
string. Try your timing again when working with a string.

The any() form is almost certainly the most effective, although I
suppose it could be implemented in C for better performance (avoiding
calling back into Python repeatedly). Not sure it's necessary though.

ChrisA
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/TOAR5FT3MDIEZFBVT7YGR6CTZ2JKCZCQ/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add a .whitespace property to module unicodedata

Reply via email to