On Tue, Mar 16, 2021 at 2:27 AM Carl Friedrich Bolz-Tereick <cfb...@gmx.de> wrote:
> On 3/15/21 11:16 PM, Dan Stromberg wrote: > > > > And it's opensource, though many of the inputs are licensed. > > > > The code is at https://stromberg.dnsalias.org/~strombrg/music-pipeline/ > > <https://stromberg.dnsalias.org/~strombrg/music-pipeline/> > > (https://stromberg.dnsalias.org/svn/music-pipeline/trunk/ > > <https://stromberg.dnsalias.org/svn/music-pipeline/trunk/>) > > > > It appears to be more than 10x slower. > > > > I haven't profiled it yet. I believe it's probably the "Blocklisting > > files..." part that's slow. That part is O(n*m), and seems to take > > forever. It's heavy on regular expressions. > > > > Are regular expressions expected to be slow on Pypy3? > > Hi Dan, > > Interesting problem! single regular expressions are reasonably fast on > PyPy, being jitted. But I don't think we looked into the problem of > "what if you have thousands of them" before. Your reproducer is hitting > a kind of known, hard to fix corner case of the JIT, it's actually > producing a linear search over the existing regular expressions for > every match call in this case, with catastrophic consequences. It's on > my mid-term plans to work on this problem, but not next week. > Here's another SSCCE that surprised me a little. I create and del the compiled regexes one at a time, but it's still slow: https://stromberg.dnsalias.org/svn/regex-fodder/trunk/regex-fodder-3 > Here's a fun workaround, that improves the performance of both CPython > (by about 2x for me) and pypy (by 10x or so): turn the many regular > expressions into a single one: > > regex_strings = [f"(?:{one_regex()})" for repno in range(2_046)] > regex_compiled = re.compile("|".join(regex_strings)) > > then you replace the match calls with a single one: > > for filename in filenames: > if regex_compiled.match(filename): > matches += 1 > > I believe you can try the same approach for your full program? > I'm familiar with the technique, as well as that of creating a single, big trie regex. For this application though, I need to check at the end if each regex was matched exactly once, to deter typos causing things to get missed. Thanks much for the suggestion and more! -- Dan Stromberg
_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev