Re: [pcre-dev] A clarification upon PCRE_NO_UTF(8|16)_CHECK

Zoltán Herczeg Fri, 06 Apr 2012 23:43:37 -0700

Hi Giuseppe,

The first one: the whole string is checked if PCRE_NO_UTF8_CHECK is not passed. 
It can be really costly if subject is long (several Mbytes). Malformed strings 
can cause PCRE to read bytes before the beginning or after the end of the 
subject string and this can lead to undesired crashes. The reason of this 
behaviour is speed: we can omit some checks which makes PCRE faster, but does 
not work with malformed input.


So your guess is right: if the subject string does not change, 
PCRE_NO_UTF8_CHECK can be passed to subsequent calls (regardless if the pattern 
changes, or it does not match or anything). Moreover, if QString guarantees 
that it always contains a valid UTF16 string, you can pass PCRE_NO_UTF8_CHECK 
all the time (just check that the starting offset also points to a beginning of 
a valid UTF16 character. This is easy, just check that the memory location does 
not point to a second part of a surrogate).

Btw, I heard that Qt5 alpha is released and contains a PCRE based 
QRegularExpression. Really nice job!

Regards,
Zoltan

"Giuseppe D'Angelo" <[email protected]> írta:
>From the docs it's not 100% clear that if PCRE_NO_UTF8_CHECK (or>
PCRE_NO_UTF16_CHECK) is NOT passed to pcre_exec then he *full* subject>
string undergoes the check. Is that the case, or the check is actually>
done in "chunks" or something like that?>
>
I'm thinking about emulating //g: if PCRE checks for the validity of>
the *whole* subject string, then subsequent calls to pcre_exec may>
safely omit this check by passing PCRE_NO_UTF8_CHECK.>
>
Thanks, and have a happy week-end.>
-- >
Giuseppe D'Angelo>
>
-- >
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev >


-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] A clarification upon PCRE_NO_UTF(8|16)_CHECK

Reply via email to