[pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

Philip Hazel Fri, 19 Dec 2014 02:32:37 -0800

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1554

--- Comment #3 from Philip Hazel <[email protected]>  2014-12-19 10:21:44 
---
On Thu, 18 Dec 2014, Vincent Lefevre wrote:

> > I understand your requirement, but it is very unlikely ever to be 
> > implemented
> > because checking each character every time it is loaded would slow down the
> > matching function far too much.
> 
> It already does this work when checking UTF-8 validity, doesn't it? 

Yes and no. It does *not* check when it is actually doing the matching. 
Before starting the matching engine, the subject is checked by a 
separate function, implemented as efficiently as possible. This means 
that the actual matching engine can assume that it is dealing with a valid 
UTF-8 string. It also means that the checking can be disabled by the 
PCRE_NO_UTF8_CHECK option when the caller knows that the string is 
valid - in particular when matching the same subject string many times.

Checking while actually doing the matching would not only complicate 
(and therefore slow down) the matching engine but would be wasteful,
because PCRE's Perl-compatible matching uses a backtracking algorithm,
and so may inspect the same character in the subject more than once.

> Well, I'm using GNU grep (often recursively), and it is not possible to do 
> this
> on a file type basis.

Does GNU grep use PCRE? Oh, is it the -P option? How does it know to use 
UTF? The pcregrep program requires you to request UTF-8 processing 
explicitly.

> Actually the problem is more general: on files with very short lines, PCRE can
> be very slow for about the same reason.

Is this also with GNU grep? It is just as slow with pcregrep? Can you 
provide an example?

> > (Incidentally, if you are just
> > looking for literal strings, there are much faster algorithms than using a
> > regular expression.)
> 
> Well, that's a PCRE problem, the goal being to be able to use the same 
> command,
> whether the pattern is a literal string or something more complex. With its 
> own
> regexp support, GNU grep apparently does this optimization (as this can be 
> seen
> with timings). Why not PCRE?

Because I concentrated on producing a fast regex library for non-literal 
regex pattern matching. An application that uses the PCRE library can
check for a literal string and do something else, of course. It sounds
as though GNU grep does this, but only when not using PCRE.

> Note: with --perl-regexp, GNU grep doesn't know how to parse a PCRE pattern, 
> so
> that it cannot do this optimization on its side.

Well, that's a GNU grep problem. :-) (Sorry, couldn't resist, but 
checking a Perl pattern for being literal can't be that hard.)

Anyway, there's no point in my being anything other than truthful: as I 
said before, it's highly unlikely that anything like this will appear in 
PCRE, as least in its current form, and at least from the current
maintainers. I won't be maintaining it for ever, so the longer term is
of course unknown. I do also still believe that the best way to search
"mixed" 8-bit text is to do it in non-UTF-8 mode, treating it as a byte
string rather than a character string, and looking for specific byte
sequences.

Philip

-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 1554] support subject strings with invalid UTF-8 sequences

Reply via email to