[pcre-dev] [Bug 2793] Case insensitive search gets exponentially slower with larger buffers and a specific text file

admin Tue, 27 Jul 2021 08:08:09 -0700

https://bugs.exim.org/show_bug.cgi?id=2793


--- Comment #1 from Philip Hazel <philip.ha...@gmail.com> ---
After a short look, I think this is an optimization effect. Your search pattern
is "EDL". PCRE2 has an optimization that recognizes that any match must start
with "E", so, before firing up the search engine, it does a fast scan through
the subject text, looking for "E". Your 1.txt file contains NO instances of
"E". Therefore you should get "no match" very quickly. However, when you
specify caseless matching, PCRE2 will try to match EDL (caselessly) at the very
many instances of "e" that exist in the file. Therefore, it will take longer.

Your 2.txt file, by contrast, does contain instances of "E".

When I tried running pcre2grep under the Linux "time" command, the numbers were
hard to interpret, so I can't comment on that at the moment. Though of course
pcre2grep is searching line-by-line, which might make a difference.

BUT: I then tried running your own program, and indeed it takes so much longer.
However, I found out why by adding PCRE2_NO_START_OPTIMIZE to the options. When
I did this, the time for your 1.txt file increased from 8638 to 216407, but the
time for 2.txt decreased from 8122391 to 217357 (i.e. not very different to
1.txt).

There is a limit on how much text the optimization will search, intended to
stop this kind of behaviour, but I suppose that because your files are so
large, it is nevertheless doing this forward search repeatedly. I will look at
the code in due course (busy with another program at the moment) to see if
there are any stupidities that may be contributing to this poor behaviour.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 2793] Case insensitive search gets exponentially slower with larger buffers and a specific text file

Reply via email to