dams Tue Jan 30 00:56:37 2001 EDT Modified files: /phpdoc/en/functions pcre.xml Log: Updated against pcre.txt : added some details about performances, nesting and infinite quantifier, and recursive subpatterns.
Index: phpdoc/en/functions/pcre.xml diff -u phpdoc/en/functions/pcre.xml:1.36 phpdoc/en/functions/pcre.xml:1.37 --- phpdoc/en/functions/pcre.xml:1.36 Mon Jan 22 14:51:51 2001 +++ phpdoc/en/functions/pcre.xml Tue Jan 30 00:56:36 2001 @@ -963,6 +963,8 @@ The following sections describe the use of each of the meta-characters. + + BACKSLASH The backslash character has several uses. Firstly, if it is followed by a non-alphameric character, it takes away any @@ -1118,6 +1120,10 @@ \Z and \z is that \Z matches before a newline that is the last character of the string as well as at the end of the string, whereas \z matches only at the end. + + + + CIRCUMFLEX AND DOLLAR Outside a character class, in the default matching mode, the @@ -1223,6 +1229,7 @@ backslash or appear in a position where it cannot be inter- preted as indicating a range, typically as the first or last character in the class. + It is not possible to have the literal character "]" as the end character of a range. A pattern such as [W-]46] is interpreted as a class of two characters ("W" and "-") fol- @@ -1274,7 +1281,6 @@ - INTERNAL OPTION SETTING The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED can be changed from within the pattern by @@ -1536,6 +1542,7 @@ nested capturing subpatterns, the corresponding captured values may have been set in previous iterations. For exam- ple, after + /(a|(b))+/ matches "aba" the value of the second captured substring is @@ -1640,6 +1647,7 @@ whatsoever, because the assertion (?!foo) is always true when the next three characters are "bar". A lookbehind assertion is needed to achieve this effect. + Lookbehind assertions start with (?<= for positive asser- tions and (?<! for negative assertions. For example, @@ -1686,22 +1694,42 @@ (?<=\d{3})(?<!999)foo matches "foo" preceded by three digits that are not "999". - Furthermore, assertions can be nested in any combination. - For example, + Notice that each of the assertions is applied independently + at the same point in the subject string. First there is a + check that the previous three characters are all digits, + then there is a check that the same three characters are not + "999". This pattern does not match "foo" preceded by six + characters, the first of which are digits and the last three + of which are not "999". For example, it doesn't match + "123abcfoo". A pattern to do that is + + (?<=\d{3}...)(?<!999)foo + + This time the first assertion looks at the preceding six + characters, checking that the first three are digits, and + then the second assertion checks that the preceding three + characters are not "999". + + Assertions can be nested in any combination. For example, (?<=(?<!foo)bar)baz matches an occurrence of "baz" that is preceded by "bar" - which in turn is not preceded by "foo". + which in turn is not preceded by "foo", while + (?<=\d{3}(?!999)...)foo + + is another pattern which matches "foo" preceded by three + digits and any three characters that are not "999". + Assertion subpatterns are not capturing subpatterns, and may not be repeated, because it makes no sense to assert the - same thing several times. If an assertion contains capturing - subpatterns within it, these are always counted for the pur- - poses of numbering the capturing subpatterns in the whole - pattern. Substring capturing is carried out for positive - assertions, but it does not make sense for negative asser- - tions. + same thing several times. If any kind of assertion contains + capturing subpatterns within it, these are counted for the + purposes of numbering the capturing subpatterns in the whole + pattern. However, substring capturing is carried out only + for positive assertions, because it does not make sense for + negative assertions. Assertions count towards the maximum of 200 parenthesized subpatterns. @@ -1762,22 +1790,21 @@ abcd$ - when applied to a long string which does not match it. - Because matching proceeds from left to right, PCRE will look - for each "a" in the subject and then see if what follows - matches the rest of the pattern. If the pattern is specified - as + when applied to a long string which does not match. Because + matching proceeds from left to right, PCRE will look for + each "a" in the subject and then see if what follows matches + the rest of the pattern. If the pattern is specified as ^.*abcd$ then the initial .* matches the entire string at first, but - when this fails, it backtracks to match all but the last - character, then all but the last two characters, and so on. - Once again the search for "a" covers the entire string, from - right to left, so we are no better off. However, if the pat- - tern is written as + when this fails (because there is no following "a"), it + backtracks to match all but the last character, then all but + the last two characters, and so on. Once again the search + for "a" covers the entire string, from right to left, so we + are no better off. However, if the pattern is written as - ^(?>.*)(?<=abcd) + ^(?>.*)(?<=abcd) then there can be no backtracking for the .* item; it can match only the entire string. The subsequent lookbehind @@ -1786,7 +1813,35 @@ this approach makes a significant difference to the process- ing time. + When a pattern contains an unlimited repeat inside a subpat- + tern that can itself be repeated an unlimited number of + times, the use of a once-only subpattern is the only way to + avoid some failing matches taking a very long time indeed. + The pattern + + (\D+|<\d+>)*[!?] + + matches an unlimited number of substrings that either con- + sist of non-digits, or digits enclosed in <>, followed by + either ! or ?. When it matches, it runs quickly. However, if + it is applied to + + aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + + it takes a long time before reporting failure. This is + because the string can be divided between the two repeats in + a large number of ways, and all have to be tried. (The exam- + ple used [!?] rather than a single character at the end, + because both PCRE and Perl have an optimization that allows + for fast failure when a single character is used. They + remember the last single character that is required for a + match, and fail early if it is not present in the string.) + If the pattern is changed to + + ((?>\D+)|<\d+>)*[!?] + sequences of non-digits cannot be broken, and failure hap- + pens quickly. CONDITIONAL SUBPATTERNS It is possible to cause the matching process to obey a sub- @@ -1804,8 +1859,8 @@ error occurs. There are two kinds of condition. If the text between the - parentheses consists of a sequence of digits, then the con- - dition is satisfied if the capturing subpattern of that + parentheses consists of a sequence of digits, then the + condition is satisfied if the capturing subpattern of that number has previously matched. Consider the following pat- tern, which contains non-significant white space to make it more readable (assume the PCRE_EXTENDED option) and to @@ -1833,7 +1888,7 @@ tives on the second line: (?(?=[^a-z]*[a-z]) - \d{2}[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) + \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) The condition is a positive lookahead assertion that matches an optional sequence of non-letters followed by a letter. In @@ -1858,6 +1913,65 @@ +RECURSIVE PATTERNS + Consider the problem of matching a string in parentheses, + allowing for unlimited nested parentheses. Without the use + of recursion, the best that can be done is to use a pattern + that matches up to some fixed depth of nesting. It is not + possible to handle an arbitrary nesting depth. Perl 5.6 has + provided an experimental facility that allows regular + expressions to recurse (amongst other things). The special + item (?R) is provided for the specific case of recursion. + This PCRE pattern solves the parentheses problem (assume + the PCRE_EXTENDED option is set so that white space is + ignored): + + \( ( (?>[^()]+) | (?R) )* \) + + First it matches an opening parenthesis. Then it matches any + number of substrings which can either be a sequence of non- + parentheses, or a recursive match of the pattern itself + (i.e. a correctly parenthesized substring). Finally there is + a closing parenthesis. + + This particular example pattern contains nested unlimited + repeats, and so the use of a once-only subpattern for match- + ing strings of non-parentheses is important when applying + the pattern to strings that do not match. For example, when + it is applied to + + (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() + + it yields "no match" quickly. However, if a once-only sub- + pattern is not used, the match runs for a very long time + indeed because there are so many different ways the + and * + repeats can carve up the subject, and all have to be tested + before failure can be reported. + + The values set for any capturing subpatterns are those from + the outermost level of the recursion at which the subpattern + value is set. If the pattern above is matched against + + (ab(cd)ef) + + the value for the capturing parentheses is "ef", which is + the last value taken on at the top level. If additional + parentheses are added, giving + + \( ( ( (?>[^()]+) | (?R) )* ) \) + ^ ^ + ^ ^ then the string they capture + is "ab(cd)ef", the contents of the top level parentheses. If + there are more than 15 capturing parentheses in a pattern, + PCRE has to obtain extra memory to store data during a + recursion, which it does by using pcre_malloc, freeing it + via pcre_free afterwards. If no memory can be obtained, it + saves data for the first 15 capturing parentheses only, as + there is no way to give an out-of-memory error from within a + recursion. + + + PERFORMANCE Certain items that may appear in patterns are more efficient than others. It is more efficient to use a character class @@ -1876,7 +1990,7 @@ match from the character immediately following one of them instead of from the very start. For example, the pattern - (.*) second + (.*) second matches the subject "first\nand second" (where \n stands for a newline character) with the first captured substring being @@ -1888,6 +2002,40 @@ setting PCRE_DOTALL, or starting the pattern with ^.* to indicate explicit anchoring. That saves PCRE from having to scan along the subject looking for a newline to restart at. + + Beware of patterns that contain nested indefinite repeats. + These can take a long time to run when applied to a string + that does not match. Consider the pattern fragment + + (a+)* + + This can match "aaaa" in 33 different ways, and this number + increases very rapidly as the string gets longer. (The * + repeat can match 0, 1, 2, 3, or 4 times, and for each of + those cases other than 0, the + repeats can match different + numbers of times.) When the remainder of the pattern is such + that the entire match is going to fail, PCRE has in princi- + ple to try every possible variation, and this can take an + extremely long time. + + An optimization catches some of the more simple cases such + as + + (a+)*b + + where a literal character follows. Before embarking on the + standard matching procedure, PCRE checks that there is a "b" + later in the subject string, and if there is not, it fails + the match immediately. However, when there is no following + literal this optimization cannot be used. You can see the + difference by comparing the behaviour of + + (a+)*\d + + with the pattern above. The former gives a failure almost + instantly when applied to a whole line of "a" characters, + whereas the latter takes an appreciable time with strings + longer than about 20 characters. </literallayout> </refsect1> </refentry>