Ezio Melotti added the comment:
Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF,
'([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means "a low
surrogates, preceded by either an high one or line beginning, and followed by
another low one or line end".
PEP 838 says "With this PEP, non-decodable bytes >= 128 will be represented as
lone surrogate codes U+DC80..U+DCFF".
If I change the regex to _has_surrogates =
re.compile('[\udc80-\udcff]').search, the tests still pass but there's no
improvement on startup time (note: the previous regex was matching all the
surrogates in this range too, however I'm not sure how well this is tested).
If I change the implementation with
_pep383_surrogates = set(map(chr, range(0xDC80, 0xDCFF+1)))
def _has_surrogates(s):
return any(c in _pep383_surrogates for c in s)
the tests still pass and the startup is ~15ms faster here:
$ time ./python -m issue11454_imp2
[68837 refs]
real 0m0.305s
user 0m0.288s
sys 0m0.012s
However using this function instead of the regex is ~10x slower at runtime.
Using the shorter regex is about ~7x faster, but there are no improvements on
the startup time.
Assuming the shorter regex is correct, it can still be called inside a function
or used with functools.partial. This will result in a improved startup time
and a ~2x improvement on runtime (so it's a win-win).
See attached patch for benchmarks.
This is a sample result:
17.01 usec/pass <- re.compile(current_regex).search
2.20 usec/pass <- re.compile(short_regex).search
148.18 usec/pass <- return any(c in surrogates for c in s)
106.35 usec/pass <- for c in s: if c in surrogates: return True
8.40 usec/pass <- return re.search(short_regex, s)
8.20 usec/pass <- functools.partial(re.search, short_regex)
----------
Added file: http://bugs.python.org/file27203/issue11454_surr1.py
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue11454>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com