Re: [pcre-dev] A performance question about PCRE library

Philip Hazel Mon, 15 Aug 2011 09:46:21 -0700

On Mon, 15 Aug 2011, 徐�z wrote:

> Case 1:
>     respectively match data source with the following 4 regular expressions.
> 
>     \btype\b\W*?\btext\b\W*?\bjavascript\b
>     \burl\b\W*?\bshell:
>     <input\b.*?\btype\b\W*?\bimage\b
>     \bonkeyup\b\W*?\=
> 
> Case 2:
> 
>     merge the 4 rules above to one rule as followed with '|' and match
> the same data source as case one.
> 
>     
> \btype\b\W*?\btext\b\W*?\bjavascript\b|\burl\b\W*?\bshell:|<input\b.*?\btype\b\W*?\bimage\b|\bonkeyup\b\W*?\=
> 
> The following is my test results.
> 
>     For case 1, cost about 2.3 seconds to do 100000 matching.
>     For case 2, cost about 24 seconds to do the same thing as case one.
> 
> The test results really surprised me. I had thought the case 2 would
> have better performance. But on the contrary, the case 2 has much
> lower performance compared with case 1. I don't know why. Could you
> please explain the reason? Are there any problems with my test case?
> The attached are my test code and data.


I ran pcretest on your 5 patterns to check on the optimizations. This is 
what I got:

$ pcretest -i zzz
PCRE version 8.12 2011-01-15

/\btype\b\W*?\btext\b\W*?\bjavascript\b/
Capturing subpattern count = 0
No options
First char = 't'
Need char = 't'

/\burl\b\W*?\bshell:/
Capturing subpattern count = 0
No options
First char = 'u'
Need char = ':'

/<input\b.*?\btype\b\W*?\bimage\b/
Capturing subpattern count = 0
No options
First char = '<'
Need char = 'e'

/\bonkeyup\b\W*?\=/
Capturing subpattern count = 0
No options
First char = 'o'
Need char = '='

/\btype\b\W*?\btext\b\W*?\bjavascript\b|\burl\b\W*?\bshell:|<input\b.*?\
btype\b\W*?\bimage\b|\bonkeyup\b\W*?\=/
Capturing subpattern count = 0
No options
No first char
No need char

Notice that the individual patterns all have a value for "First char". 
That means, pcre_exec will skip along the subject to that character 
before starting a match. The combined pattern does not have this, so it 
will try the match at EVERY character.

I tried studying the combined pattern, assuming it would find a starting 
byte set of < o t and u (which is what studying does), but it didn't. 
This I think is a bug in pcre_study(), which I will investigate (just in 
time for the forthcoming 8.13 release). Looks like it is the initial \b
in three patterns that is foxing it, because when I remove that, it does
find the starting bytes.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] A performance question about PCRE library

Reply via email to