seanfabs opened a new pull request, #239: URL: https://github.com/apache/commons-email/pull/239
There are several fixes and improvements to the regex's here: 1. The regex could run with exponential time in the `\\s*[^>]*?\\s+` part, this is because the `\s`, `[^>]`, and `\\s+` parts can all match whitespace and so when excessive whitespace is encountered it will attempt all combinations to find a match. This is a fairly well documented phenomenon https://blog.codinghorror.com/regex-performance/ 2. The `<script>` tag matching did not match multiple scripts. 3. Fix the edge case where tags could match which started with img or script e.g. `<imgx ...>` 4. Use a greedy matcher for the src url `[^\"']+` which gives a slight performance boost - we will always want to grab the whole url. # Benchmark results VM version: JDK 17.0.10, OpenJDK 64-Bit Server VM, 17.0.10+8-b1207.12 Result for old regex `(<[Ii][Mm][Gg]\s*[^>]*?\s+[Ss][Rr][Cc]\s*=\s*["'])([^"']+?)(["'])` 32.339 ±(99.9%) 4.394 ops/s [Average] (min, avg, max) = (20.590, 32.339, 40.601), stdev = 5.866 CI (99.9%): [27.945, 36.733] (assumes normal distribution) Result for new regex - candidate non greedy url `(<[Ii][Mm][Gg](?=\s)[^>]*?\s[Ss][Rr][Cc]\s*=\s*["'])([^"']+?)(["'])` 1244.602 ±(99.9%) 225.819 ops/s [Average] (min, avg, max) = (518.410, 1244.602, 1512.089), stdev = 301.462 CI (99.9%): [1018.783, 1470.420] (assumes normal distribution) Result for new regex `(<[Ii][Mm][Gg](?=\s)[^>]*?\s[Ss][Rr][Cc]\s*=\s*["'])([^"']+)(["'])` 1480.002 ±(99.9%) 160.408 ops/s [Average] (min, avg, max) = (986.366, 1480.002, 1651.095), stdev = 214.140 CI (99.9%): [1319.594, 1640.409] (assumes normal distribution) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
