------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1554 --- Comment #3 from Philip Hazel <[email protected]> 2014-12-19 10:21:44 --- On Thu, 18 Dec 2014, Vincent Lefevre wrote: > > I understand your requirement, but it is very unlikely ever to be > > implemented > > because checking each character every time it is loaded would slow down the > > matching function far too much. > > It already does this work when checking UTF-8 validity, doesn't it? Yes and no. It does *not* check when it is actually doing the matching. Before starting the matching engine, the subject is checked by a separate function, implemented as efficiently as possible. This means that the actual matching engine can assume that it is dealing with a valid UTF-8 string. It also means that the checking can be disabled by the PCRE_NO_UTF8_CHECK option when the caller knows that the string is valid - in particular when matching the same subject string many times. Checking while actually doing the matching would not only complicate (and therefore slow down) the matching engine but would be wasteful, because PCRE's Perl-compatible matching uses a backtracking algorithm, and so may inspect the same character in the subject more than once. > Well, I'm using GNU grep (often recursively), and it is not possible to do > this > on a file type basis. Does GNU grep use PCRE? Oh, is it the -P option? How does it know to use UTF? The pcregrep program requires you to request UTF-8 processing explicitly. > Actually the problem is more general: on files with very short lines, PCRE can > be very slow for about the same reason. Is this also with GNU grep? It is just as slow with pcregrep? Can you provide an example? > > (Incidentally, if you are just > > looking for literal strings, there are much faster algorithms than using a > > regular expression.) > > Well, that's a PCRE problem, the goal being to be able to use the same > command, > whether the pattern is a literal string or something more complex. With its > own > regexp support, GNU grep apparently does this optimization (as this can be > seen > with timings). Why not PCRE? Because I concentrated on producing a fast regex library for non-literal regex pattern matching. An application that uses the PCRE library can check for a literal string and do something else, of course. It sounds as though GNU grep does this, but only when not using PCRE. > Note: with --perl-regexp, GNU grep doesn't know how to parse a PCRE pattern, > so > that it cannot do this optimization on its side. Well, that's a GNU grep problem. :-) (Sorry, couldn't resist, but checking a Perl pattern for being literal can't be that hard.) Anyway, there's no point in my being anything other than truthful: as I said before, it's highly unlikely that anything like this will appear in PCRE, as least in its current form, and at least from the current maintainers. I won't be maintaining it for ever, so the longer term is of course unknown. I do also still believe that the best way to search "mixed" 8-bit text is to do it in non-UTF-8 mode, treating it as a byte string rather than a character string, and looking for specific byte sequences. Philip -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
