------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1554 --- Comment #4 from Vincent Lefevre <[email protected]> 2014-12-20 00:56:31 --- (In reply to comment #3) > Yes and no. It does *not* check when it is actually doing the matching. > Before starting the matching engine, the subject is checked by a > separate function, implemented as efficiently as possible. I meant that after doing the check, there should be a way to do the matching from the beginning of the subject string to the first invalid byte (not included), instead of returning an encoding error. Otherwise grep needs to do 2 calls to pcre_exec, and this is slower. But see below. > > Well, I'm using GNU grep (often recursively), and it is not possible to > > do this on a file type basis. > > Does GNU grep use PCRE? Oh, is it the -P option? Yes, with the -P option. > How does it know to use UTF? With -P, GNU grep assumes that the UTF-8 encoding is used, possible with invalid sequences. Currently it calls pcre_exec with the check, and in case of encoding error, it calls pcre_exec again without the check from the beginning of the subject string to the first invalid byte (not included), and it reiterates. But this seems to be a bad solution: $ time grep -P zzz file.pdf grep -P zzz file.pdf 6.64s user 0.04s system 99% cpu 6.692 total $ time pcregrep zzz file.pdf pcregrep zzz file.pdf 0.71s user 0.04s system 98% cpu 0.758 total But note that grep with its own regexp engine wins: $ time grep zzz file.pdf grep zzz file.pdf 0.06s user 0.04s system 77% cpu 0.135 total Similar timings with the pattern 'z[0-9]zzz' (which is not a literal string). This shows that pcregrep could still be improved in some cases (though this may not be related to UTF-8 validity, and on some patterns, PCRE is sometimes faster than the GNU grep regexp engine, as seen below). > > Actually the problem is more general: on files with very short lines, > > PCRE can be very slow for about the same reason. > > Is this also with GNU grep? It is just as slow with pcregrep? Can you > provide an example? Actually pcregrep is a bit slower than grep -P! On a file with 10,000,000 lines containing only "a": $ time grep zzz file grep zzz file 0.00s user 0.00s system 55% cpu 0.007 total $ time grep -P zzz file grep -P zzz file 0.55s user 0.01s system 96% cpu 0.578 total $ time pcregrep zzz file pcregrep zzz file 0.64s user 0.00s system 97% cpu 0.659 total $ time grep '[0-9]' file grep '[0-9]' file 3.73s user 0.01s system 99% cpu 3.749 total $ time grep -P '[0-9]' file grep -P '[0-9]' file 0.56s user 0.00s system 95% cpu 0.590 total $ time pcregrep '[0-9]' file pcregrep '[0-9]' file 0.64s user 0.00s system 97% cpu 0.663 total > Because I concentrated on producing a fast regex library for non-literal > regex pattern matching. An application that uses the PCRE library can > check for a literal string and do something else, of course. It sounds > as though GNU grep does this, but only when not using PCRE. Actually, grep is faster than the PCRE library on particular patterns, not just literal strings, as seen above. > > Note: with --perl-regexp, GNU grep doesn't know how to parse a PCRE > > pattern, so that it cannot do this optimization on its side. > > Well, that's a GNU grep problem. :-) (Sorry, couldn't resist, but > checking a Perl pattern for being literal can't be that hard.) It's also a pcregrep problem, and not solved there either! Anyway, even patterns like 'z[0-9]zzz' are concerned, as seen above. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
