On 3/15/21 11:16 PM, Dan Stromberg wrote:

And it's opensource, though many of the inputs are licensed.

The code is at https://stromberg.dnsalias.org/~strombrg/music-pipeline/
<https://stromberg.dnsalias.org/~strombrg/music-pipeline/>
(https://stromberg.dnsalias.org/svn/music-pipeline/trunk/
<https://stromberg.dnsalias.org/svn/music-pipeline/trunk/>)

It appears to be more than 10x slower.

I haven't profiled it yet.  I believe it's probably the "Blocklisting
files..." part that's slow.  That part is O(n*m), and seems to take
forever.  It's heavy on regular expressions.

Are regular expressions expected to be slow on Pypy3?

Hi Dan,

Interesting problem! single regular expressions are reasonably fast on
PyPy, being jitted. But I don't think we looked into the problem of
"what if you have thousands of them" before. Your reproducer is hitting
a kind of known, hard to fix corner case of the JIT, it's actually
producing a linear search over the existing regular expressions for
every match call in this case, with catastrophic consequences. It's on
my mid-term plans to work on this problem, but not next week.

Here's a fun workaround, that improves the performance of both CPython
(by about 2x for me) and pypy (by 10x or so): turn the many regular
expressions into a single one:

    regex_strings = [f"(?:{one_regex()})" for repno in range(2_046)]
    regex_compiled = re.compile("|".join(regex_strings))

then you replace the match calls with a single one:

    for filename in filenames:
        if regex_compiled.match(filename):
            matches += 1

I believe you can try the same approach for your full program?

Cheers,

Carl Friedrich
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to