New submission from Serhiy Storchaka:
Single-character strings in the Latin1 range (U+0000 - U+00FF) are shared in
CPython. This saves memory and CPU time of per-character processing of strings
containing ASCII characters and characters from Latin based alphabets. But the
users of languages that use non-Latin based alphabets are not so lucky.
Proposed PR adds a cache for characters in BMP (U+0100 - U+FFFF) which covers
most alphabetic scripts.
Most alphabets contain characters from a single 128- or 256-character block,
therefore only lowest bits are used for addressing in the cache. But together
with the characters from a particular alphabet it is common to use ASCII
characters (spaces, newline, punctuations, digits, etc) and few Unicode
punctuation (long dash, Unicode quotes, etc). Their low bits can match low
bits of letters. Therefore every index addresses not a single character, but a
mini-LRU-cache of size 2. This keeps letters in a cache even if non-letters
with conflicting low bits are occurred.
Microbenchmarking results.
Iterating sample non-Latin-based alphabetic text (Iliad by Homer [1]) is over
70% faster:
$ ./python -m timeit -s 's = open("36248-0.txt").read()' -- 'for c in s: pass'
Unpatched: 20 loops, best of 5: 14.5 msec per loop
Patched: 50 loops, best of 5: 8.32 msec per loop
Iterating sample hieroglyphic text (Shui Hu Zhuan by Nai an Shi [2]) is about
4% slower:
$ ./python -m timeit -s 's = open("23863-0.txt").read()' -- 'for c in s: pass'
Unpatched: 20 loops, best of 5: 11.7 msec per loop
Patched: 20 loops, best of 5: 12.1 msec per loop
Iterating a string containing non-repeated characters from the all BMP range is
20% slower:
$ ./python -m timeit -s 's = "".join(map(chr, range(0x10000)))' -- 'for c in s:
pass'
Unpatched: 200 loops, best of 5: 1.39 msec per loop
Patched: 200 loops, best of 5: 1.7 msec per loop
[1] https://www.gutenberg.org/files/36248/36248-0.txt
[2] https://www.gutenberg.org/files/23863/23863-0.txt
----------
components: Interpreter Core, Unicode
messages: 302253
nosy: benjamin.peterson, ezio.melotti, haypo, lemburg, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Cache single-character strings outside of the Latin1 range
type: performance
versions: Python 3.7
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue31484>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com