Hi all,
we have been working on the auto-possessifying optimization in PCRE for some
time, and I would like to give you a brief summary of the results we achived.
This optimization replaces greedy/non-greedy single character repetitions with
their appropriate posessive form. A simple example is rewriting /a+b/ to
/a++b/. This optimization has been part of PCRE for a very long time, but it
only supported simple cases before. Now it can even replace \s* to \s*+ in
/\s*(?:left|right)?hand/. Of course using \s*+ directly in the pattern would
provide the same effect, but possessive quantifiers are among the less known
regular expression features, and they are rarely used (this is my impression at
least). The performance of possessive quantifiers are usually much higher than
other quantifiers, since the backtracking phase can be totally skipped.
However, before I show some results, let me tell you a bit more about
possessive quantifiers. Their primary purpose is to define unbreakable
multi-byte character sequences. The improved matching speed is just a side
effect. For example, "sch" represent a single consonant in german, splitting it
to "sc" and "h" is meaningless. Another nice example is newlines: most of the
time a newline can be \r, \n and \r\n. If we want
to find the "aa" and "bb" strings, which are separated by at least two
newlines, we can use the /aa(?>\r\n|\r|\n){2,}bb/ pattern. Without the ?>
bracket type, the pattern would happily accept aa\r\nbb as well, and that is
incorrect.
Back to the results, then. Once I got a pattern set used by an Intrusion
Detection System, and I use it for benchmarking and also getting ideas how
people use regular expressions. Sometimes I browse http://regexlib.com/ as
well. I realized that most patterns are not exactly efficient, so regex
compiler optimizations such as auto-possessifying seems very important. The
gain provided by this particular optimization is the following (INT:
interpreter, JIT: PCRE-JIT compiler, s: seconds):
was: INT: 412.16 s, JIT: 86.22 s
now: INT: 182.94 s, JIT: 45.46 s
progress: INT: 125% JIT: 90%
Of course on other pattern sets the results might be totally different, but we
hope this helps to improve the overall performance of our favourite regex
engine.
Regards,
Zoltan
--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev