vrana Tue Dec 23 08:07:58 2003 EDT
Modified files:
/phpdoc/en/reference/pcre/functions pcre.pattern.syntax.xml
Log:
literallayout changed to para
Index: phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml
diff -u phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml:1.7
phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml:1.8
--- phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml:1.7 Fri Dec 19
10:49:44 2003
+++ phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml Tue Dec 23 08:07:58
2003
@@ -1,5 +1,5 @@
<?xml version="1.0" encoding="iso-8859-1"?>
-<!-- $Revision: 1.7 $ -->
+<!-- $Revision: 1.8 $ -->
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
<refentry id="pcre.pattern.syntax">
<refnamediv>
@@ -159,7 +159,8 @@
Friedl's "Mastering Regular Expressions", published by
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
The description here is intended as reference documentation.
-
+ </para>
+ <para>
A regular expression is a pattern that is matched against a
subject string from left to right. Most characters stand for
themselves in a pattern, and match the corresponding
@@ -742,13 +743,14 @@
<refsect2 id="regexp.reference.circudollar">
<title>Circumflex and dollar</title>
- <literallayout>
+ <para>
Outside a character class, in the default matching mode, the
circumflex character is an assertion which is true only if
the current matching point is at the start of the subject
string. Inside a character class, circumflex has an entirely
different meaning (see below).
-
+ </para>
+ <para>
Circumflex need not be the first character of the pattern if
a number of alternatives are involved, but it should be the
first thing in each alternative in which it appears if the
@@ -757,7 +759,8 @@
constrained to match only at the start of the subject, it is
said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
-
+ </para>
+ <para>
A dollar character is an assertion which is &true; only if the
current matching point is at the end of the subject string,
or immediately before a newline character that is the last
@@ -766,13 +769,15 @@
are involved, but it should be the last item in any branch
in which it appears. Dollar has no special meaning in a
character class.
-
+ </para>
+ <para>
The meaning of dollar can be changed so that it matches only
at the very end of the string, by setting the
<link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
option at compile or matching time. This
does not affect the \Z assertion.
-
+ </para>
+ <para>
The meanings of the circumflex and dollar characters are
changed if the <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>
option is set. When this is
the case, they match immediately after and immediately
@@ -784,17 +789,18 @@
because all branches start with "^" are not anchored in
multiline mode. The <link
linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> option is ignored if
<link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> is set.
-
+ </para>
+ <para>
Note that the sequences \A, \Z, and \z can be used to match
the start and end of the subject in both modes, and if all
branches of a pattern start with \A is it always anchored,
whether <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> is set or
not.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.dot">
<title>FULL STOP</title>
- <literallayout>
+ <para>
Outside a character class, a dot in the pattern matches any
one character in the subject, including a non-printing
character, but not (by default) newline. If the <link
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
@@ -803,19 +809,20 @@
circumflex and dollar, the only relationship being that they
both involve newline characters. Dot has no special meaning
in a character class.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.squarebrackets">
<title>Square brackets</title>
- <literallayout>
+ <para>
An opening square bracket introduces a character class,
terminated by a closing square bracket. A closing square
bracket on its own is not special. If a closing square
bracket is required as a member of the class, it should be
the first data character in the class (after an initial
circumflex, if present) or escaped with a backslash.
-
+ </para>
+ <para>
A character class matches a single character in the subject;
the character must be in the set of characters defined by
the class, unless the first character in the class is a
@@ -823,7 +830,8 @@
the set defined by the class. If a circumflex is actually
required as a member of the class, ensure it is not the
first character, or escape it with a backslash.
-
+ </para>
+ <para>
For example, the character class [aeiou] matches any lower
case vowel, while [^aeiou] matches any character that is not
a lower case vowel. Note that a circumflex is just a
@@ -832,18 +840,21 @@
assertion: it still consumes a character from the subject
string, and fails if the current pointer is at the end of
the string.
-
+ </para>
+ <para>
When caseless matching is set, any letters in a class
represent both their upper case and lower case versions, so
for example, a caseless [aeiou] matches "A" as well as "a",
and a caseless [^aeiou] does not match "A", whereas a
caseful version would.
-
+ </para>
+ <para>
The newline character is never treated in any special way in
character classes, whatever the setting of the <link
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
or <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> options is. A
class such as [^a] will
always match a newline.
-
+ </para>
+ <para>
The minus (hyphen) character can be used to specify a range
of characters in a character class. For example, [d-m]
matches any letter between d and m, inclusive. If a minus
@@ -851,7 +862,8 @@
backslash or appear in a position where it cannot be
interpreted as indicating a range, typically as the first or last
character in the class.
-
+ </para>
+ <para>
It is not possible to have the literal character "]" as the
end character of a range. A pattern such as [W-]46] is
interpreted as a class of two characters ("W" and "-")
@@ -861,7 +873,8 @@
interpreted as a single class containing a range followed by two
separate characters. The octal or hexadecimal representation
of "]" can also be used to end a range.
-
+ </para>
+ <para>
Ranges operate in ASCII collating sequence. They can also be
used for characters specified numerically, for example
[\000-\037]. If a range that includes letters is used when
@@ -870,7 +883,8 @@
matched caselessly, and if character tables for the "fr"
locale are in use, [\xc8-\xcb] matches accented E characters
in both cases.
-
+ </para>
+ <para>
The character types \d, \D, \s, \S, \w, and \W may also
appear in a character class, and add the characters that
they match to the class. For example, [\dABCDEF] matches any
@@ -879,20 +893,21 @@
restricted set of characters than the matching lower case type.
For example, the class [^\W_] matches any letter or digit,
but not underscore.
-
+ </para>
+ <para>
All non-alphanumeric characters other than \, -, ^ (at the
start) and the terminating ] are non-special in character
classes, but it does no harm if they are escaped.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.verticalbar">
<title>Vertical bar</title>
- <literallayout>
+ <para>
Vertical bar characters are used to separate alternative
patterns. For example, the pattern
- gilbert|sullivan
+ <literal>gilbert|sullivan</literal>
matches either "gilbert" or "sullivan". Any number of alternatives
may appear, and an empty alternative is permitted
@@ -902,56 +917,82 @@
subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the
subpattern.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.internal-options">
<title>Internal option setting</title>
- <literallayout>
- The settings of <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> ,
- <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> ,
- <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> ,
+ <para>
+ The settings of <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link>,
+ <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>,
+ <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>,
and <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> can be changed
from within the pattern by
a sequence of Perl option letters enclosed between "(?" and
")". The option letters are
- i for <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link>
- m for <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>
- s for <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
- x for <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>
-
+ <table>
+ <title>Internal option letters</title>
+ <tgroup cols="2">
+ <tbody>
+ <row>
+ <entry><literal>i</literal></entry>
+ <entry>for <link
linkend="pcre.pattern.modifiers">PCRE_CASELESS</link></entry>
+ </row>
+ <row>
+ <entry><literal>m</literal></entry>
+ <entry>for <link
linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link></entry>
+ </row>
+ <row>
+ <entry><literal>s</literal></entry>
+ <entry>for <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link></entry>
+ </row>
+ <row>
+ <entry><literal>x</literal></entry>
+ <entry>for <link
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </para>
+ <para>
For example, (?im) sets caseless, multiline matching. It is
also possible to unset these options by preceding the letter
with a hyphen, and a combined setting and unsetting such as
(?im-sx), which sets <link
linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> and <link
linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> while
- unsetting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> and <link
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> , is also permitted.
+ unsetting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> and <link
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>, is also permitted.
If a letter appears both before and after the hyphen, the
option is unset.
-
+ </para>
+ <para>
The scope of these option changes depends on where in the
pattern the setting occurs. For settings that are outside
any subpattern (defined below), the effect is the same as if
the options were set or unset at the start of matching. The
following patterns all behave in exactly the same way:
+ </para>
+ <literallayout>
(?i)abc
a(?i)bc
ab(?i)c
abc(?i)
+ </literallayout>
+ <para>
which in turn is the same as compiling the pattern abc with
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> set.
In other words, such "top level" settings apply to the whole
pattern (unless there are other changes inside subpatterns).
If there is more than one setting of the same option at top level,
the rightmost setting is used.
-
+ </para>
+ <para>
If an option change occurs inside a subpattern, the effect
is different. This is a change of behaviour in Perl 5.005.
An option change inside a subpattern affects only that part
of the subpattern that follows it, so
- (a(?i)b)c
+ <literal>(a(?i)b)c</literal>
matches abc and aBc and no other strings (assuming
<link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> is not used). By
this means, options can be
@@ -960,13 +1001,14 @@
into subsequent branches within the same subpattern. For
example,
- (a(?i)b|c)
+ <literal>(a(?i)b|c)</literal>
matches "ab", "aB", "c", and "C", even though when matching
"C" the first branch is abandoned before the option setting.
This is because the effects of option settings happen at
compile time. There would be some very weird behaviour otherwise.
-
+ </para>
+ <para>
The PCRE-specific options <link
linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
<link linkend="pcre.pattern.modifiers">PCRE_EXTRA</link> can
be changed in the same way as the Perl-compatible options by
@@ -974,25 +1016,27 @@
setting is special in that it must always occur earlier in
the pattern than any of the additional features it turns on,
even when it is at top level. It is best put at the start.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.subpatterns">
<title>subpatterns</title>
- <literallayout>
+ <para>
Subpatterns are delimited by parentheses (round brackets),
which can be nested. Marking part of a pattern as a subpattern
does two things:
-
+ </para>
+ <para>
1. It localizes a set of alternatives. For example, the
pattern
- cat(aract|erpillar|)
+ <literal>cat(aract|erpillar|)</literal>
matches one of the words "cat", "cataract", or "caterpillar".
Without the parentheses, it would match "cataract",
"erpillar" or the empty string.
-
+ </para>
+ <para>
2. It sets up the subpattern as a capturing subpattern (as
defined above). When the whole pattern matches, that portion
of the subject string that matched the subpattern is
@@ -1001,15 +1045,17 @@
<function>pcre_exec</function>. Opening parentheses are counted
from left to right (starting from 1) to obtain the numbers of the
capturing subpatterns.
-
+ </para>
+ <para>
For example, if the string "the red king" is matched against
the pattern
- the ((red|white) (king|queen))
+ <literal>the ((red|white) (king|queen))</literal>
the captured substrings are "red king", "red", and "king",
and are numbered 1, 2, and 3.
-
+ </para>
+ <para>
The fact that plain parentheses fulfil two functions is not
always helpful. There are often times when a grouping subpattern
is required without a capturing requirement. If an
@@ -1019,49 +1065,57 @@
if the string "the white queen" is matched against the
pattern
- the ((?:red|white) (king|queen))
+ <literal>the ((?:red|white) (king|queen))</literal>
the captured substrings are "white queen" and "queen", and
are numbered 1 and 2. The maximum number of captured substrings
is 99, and the maximum number of all subpatterns,
both capturing and non-capturing, is 200.
-
+ </para>
+ <para>
As a convenient shorthand, if any option settings are
required at the start of a non-capturing subpattern, the
option letters may appear between the "?" and the ":". Thus
the two patterns
+ </para>
+ <literallayout>
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
+ </literallayout>
+ <para>
match exactly the same set of strings. Because alternative
branches are tried from left to right, and options are not
reset until the end of the subpattern is reached, an option
setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.repetition">
<title>Repetition</title>
- <literallayout>
+ <para>
Repetition is specified by quantifiers, which can follow any
of the following items:
- a single character, possibly escaped
- the . metacharacter
- a character class
- a back reference (see next section)
- a parenthesized subpattern (unless it is an assertion -
- see below)
-
+ <itemizedlist>
+ <listitem><simpara>a single character, possibly escaped</simpara></listitem>
+ <listitem><simpara>the . metacharacter</simpara></listitem>
+ <listitem><simpara>a character class</simpara></listitem>
+ <listitem><simpara>a back reference (see next section)</simpara></listitem>
+ <listitem><simpara>a parenthesized subpattern (unless it is an assertion -
+ see below)</simpara></listitem>
+ </itemizedlist>
+ </para>
+ <para>
The general repetition quantifier specifies a minimum and
maximum number of permitted matches, by giving the two
numbers in curly brackets (braces), separated by a comma.
The numbers must be less than 65536, and the first must be
less than or equal to the second. For example:
- z{2,4}
+ <literal>z{2,4}</literal>
matches "zz", "zzz", or "zzzz". A closing brace on its own
is not a special character. If the second number is omitted,
@@ -1069,42 +1123,63 @@
second number and the comma are both omitted, the quantifier
specifies an exact number of required matches. Thus
- [aeiou]{3,}
+ <literal>[aeiou]{3,}</literal>
matches at least 3 successive vowels, but may match many
more, while
- \d{8}
+ <literal>\d{8}</literal>
matches exactly 8 digits. An opening curly bracket that
appears in a position where a quantifier is not allowed, or
one that does not match the syntax of a quantifier, is taken
as a literal character. For example, {,6} is not a quantifier,
but a literal string of four characters.
-
+ </para>
+ <para>
The quantifier {0} is permitted, causing the expression to
behave as if the previous item and the quantifier were not
present.
-
+ </para>
+ <para>
For convenience (and historical compatibility) the three
most common quantifiers have single-character abbreviations:
- * is equivalent to {0,}
- + is equivalent to {1,}
- ? is equivalent to {0,1}
-
+ <table>
+ <title>Single-character quantifiers</title>
+ <tgroup cols="2">
+ <tbody>
+ <row>
+ <entry><literal>*</literal></entry>
+ <entry>equivalent to <literal>{0,}</literal></entry>
+ </row>
+ <row>
+ <entry><literal>+</literal></entry>
+ <entry>equivalent to <literal>{1,}</literal></entry>
+ </row>
+ <row>
+ <entry><literal>?</literal></entry>
+ <entry>equivalent to <literal>{0,1}</literal></entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </table>
+ </para>
+ <para>
It is possible to construct infinite loops by following a
subpattern that can match no characters with a quantifier
that has no upper limit, for example:
- (a?)*
-
+ <literal>(a?)*</literal>
+ </para>
+ <para>
Earlier versions of Perl and PCRE used to give an error at
compile time for such patterns. However, because there are
cases where this can be useful, such patterns are now
accepted, but if any repetition of the subpattern does in
fact match no characters, the loop is forcibly broken.
-
+ </para>
+ <para>
By default, the quantifiers are "greedy", that is, they
match as much as possible (up to the maximum number of permitted
times), without causing the rest of the pattern to
@@ -1114,20 +1189,21 @@
* and / characters may appear. An attempt to match C comments
by applying the pattern
- /\*.*\*/
+ <literal>/\*.*\*/</literal>
to the string
- /* first command */ not comment /* second comment */
+ <literal>/* first command */ not comment /* second comment */</literal>
fails, because it matches the entire string due to the
greediness of the .* item.
-
+ </para>
+ <para>
However, if a quantifier is followed by a question mark,
then it ceases to be greedy, and instead matches the minimum
number of times possible, so the pattern
- /\*.*?\*/
+ <literal>/\*.*?\*/</literal>
does the right thing with the C comments. The meaning of the
various quantifiers is not otherwise changed, just the preferred
@@ -1136,22 +1212,25 @@
Because it has two uses, it can sometimes appear doubled, as
in
- \d??\d
+ <literal>\d??\d</literal>
which matches one digit by preference, but can match two if
that is the only way the rest of the pattern matches.
-
+ </para>
+ <para>
If the <link linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link> option is
set (an option which is not
available in Perl) then the quantifiers are not greedy by
default, but individual ones can be made greedy by following
them with a question mark. In other words, it inverts the
default behaviour.
-
+ </para>
+ <para>
When a parenthesized subpattern is quantified with a minimum
repeat count that is greater than 1 or with a limited maximum,
more store is required for the compiled pattern, in
proportion to the size of the minimum or maximum.
-
+ </para>
+ <para>
If a pattern starts with .* or .{0,} and the <link
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>
option (equivalent to Perl's /s) is set, thus allowing the .
to match newlines, then the pattern is implicitly anchored,
@@ -1163,11 +1242,12 @@
no newlines, it is worth setting <link
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> when the pattern begins with .*
in order to
obtain this optimization, or
alternatively using ^ to indicate anchoring explicitly.
-
+ </para>
+ <para>
When a capturing subpattern is repeated, the value captured
is the substring that matched the final iteration. For example, after
- (tweedle[dume]{3}\s*)+
+ <literal>(tweedle[dume]{3}\s*)+</literal>
has matched "tweedledum tweedledee" the value of the captured
substring is "tweedledee". However, if there are
@@ -1175,22 +1255,23 @@
values may have been set in previous iterations. For example,
after
- /(a|(b))+/
+ <literal>/(a|(b))+/</literal>
matches "aba" the value of the second captured substring is
"b".
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.back-references">
<title>BACK REFERENCES</title>
- <literallayout>
+ <para>
Outside a character class, a backslash followed by a digit
greater than 0 (and possibly further digits) is a back
reference to a capturing subpattern earlier (i.e. to its
left) in the pattern, provided there have been that many
previous capturing left parentheses.
-
+ </para>
+ <para>
However, if the decimal number following the backslash is
less than 10, it is always taken as a back reference, and
causes an error only if there are not that many capturing
@@ -1199,29 +1280,31 @@
the reference for numbers less than 10. See the section
entitled "Backslash" above for further details of the handling
of digits following a backslash.
-
+ </para>
+ <para>
A back reference matches whatever actually matched the capturing
subpattern in the current subject string, rather than
anything matching the subpattern itself. So the pattern
- (sens|respons)e and \1ibility
+ <literal>(sens|respons)e and \1ibility</literal>
matches "sense and sensibility" and "response and responsibility",
but not "sense and responsibility". If caseful
matching is in force at the time of the back reference, then
the case of letters is relevant. For example,
- ((?i)rah)\s+\1
+ <literal>((?i)rah)\s+\1</literal>
matches "rah rah" and "RAH RAH", but not "RAH rah", even
though the original capturing subpattern is matched caselessly.
-
+ </para>
+ <para>
There may be more than one back reference to the same subpattern.
If a subpattern has not actually been used in a
particular match, then any back references to it always
fail. For example, the pattern
- (a|(bc))\2
+ <literal>(a|(bc))\2</literal>
always fails if it starts to match "a" rather than "bc".
Because there may be up to 99 back references, all digits
@@ -1230,13 +1313,14 @@
character, then some delimiter must be used to terminate the
back reference. If the <link
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option is set, this can
be whitespace. Otherwise an empty comment can be used.
-
+ </para>
+ <para>
A back reference that occurs inside the parentheses to which
it refers fails when the subpattern is first used, so, for
example, (a\1) never matches. However, such references can
be useful inside repeated subpatterns. For example, the pattern
- (a|b\1)+
+ <literal>(a|b\1)+</literal>
matches any number of "a"s and also "aba", "ababaa" etc. At
each iteration of the subpattern, the back reference matches
@@ -1245,12 +1329,12 @@
that the first iteration does not need to match the back
reference. This can be done using alternation, as in the
example above, or by a quantifier with a minimum of zero.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.assertions">
<title>Assertions</title>
- <literallayout>
+ <para>
An assertion is a test on the characters following or
preceding the current matching point that does not actually
consume any characters. The simple assertions coded as \b,
@@ -1258,34 +1342,36 @@
assertions are coded as subpatterns. There are two
kinds: those that look ahead of the current position in the
subject string, and those that look behind it.
-
+ </para>
+ <para>
An assertion subpattern is matched in the normal way, except
that it does not cause the current matching position to be
changed. Lookahead assertions start with (?= for positive
assertions and (?! for negative assertions. For example,
- \w+(?=;)
+ <literal>\w+(?=;)</literal>
matches a word followed by a semicolon, but does not include
the semicolon in the match, and
- foo(?!bar)
+ <literal>foo(?!bar)</literal>
matches any occurrence of "foo" that is not followed by
"bar". Note that the apparently similar pattern
- (?!foo)bar
+ <literal>(?!foo)bar</literal>
does not find an occurrence of "bar" that is preceded by
something other than "foo"; it finds any occurrence of "bar"
whatsoever, because the assertion (?!foo) is always &true;
when the next three characters are "bar". A lookbehind
assertion is needed to achieve this effect.
-
+ </para>
+ <para>
Lookbehind assertions start with (?<= for positive assertions
and (?<! for negative assertions. For example,
- (?<!foo)bar
+ <literal>(?<!foo)bar</literal>
does find an occurrence of "bar" that is not preceded by
"foo". The contents of a lookbehind assertion are restricted
@@ -1293,11 +1379,11 @@
length. However, if there are several alternatives, they do
not all have to have the same fixed length. Thus
- (?<=bullock|donkey)
+ <literal>(?<=bullock|donkey)</literal>
is permitted, but
- (?<!dogs?|cats?)
+ <literal>(?<!dogs?|cats?)</literal>
causes an error at compile time. Branches that match different
length strings are permitted only at the top level of
@@ -1305,13 +1391,13 @@
Perl 5.005, which requires all branches to match the same
length of string. An assertion such as
- (?<=ab(c|de))
+ <literal>(?<=ab(c|de))</literal>
is not permitted, because its single top-level branch can
match two different lengths, but it is acceptable if rewritten
to use two top-level branches:
- (?<=abc|abde)
+ <literal>(?<=abc|abde)</literal>
The implementation of lookbehind assertions is, for each
alternative, to temporarily move the current position back
@@ -1321,11 +1407,12 @@
once-only subpatterns can be particularly useful for matching
at the ends of strings; an example is given at the end
of the section on once-only subpatterns.
-
+ </para>
+ <para>
Several assertions (of any sort) may occur in succession.
For example,
- (?<=\d{3})(?<!999)foo
+ <literal>(?<=\d{3})(?<!999)foo</literal>
matches "foo" preceded by three digits that are not "999".
Notice that each of the assertions is applied independently
@@ -1337,25 +1424,28 @@
of which are not "999". For example, it doesn't match
"123abcfoo". A pattern to do that is
- (?<=\d{3}...)(?<!999)foo
-
+ <literal>(?<=\d{3}...)(?<!999)foo</literal>
+ </para>
+ <para>
This time the first assertion looks at the preceding six
characters, checking that the first three are digits, and
then the second assertion checks that the preceding three
characters are not "999".
-
+ </para>
+ <para>
Assertions can be nested in any combination. For example,
- (?<=(?<!foo)bar)baz
+ <literal>(?<=(?<!foo)bar)baz</literal>
matches an occurrence of "baz" that is preceded by "bar"
which in turn is not preceded by "foo", while
- (?<=\d{3}(?!999)...)foo
+ <literal>(?<=\d{3}(?!999)...)foo</literal>
is another pattern which matches "foo" preceded by three
digits and any three characters that are not "999".
-
+ </para>
+ <para>
Assertion subpatterns are not capturing subpatterns, and may
not be repeated, because it makes no sense to assert the
same thing several times. If any kind of assertion contains
@@ -1364,15 +1454,16 @@
pattern. However, substring capturing is carried out only
for positive assertions, because it does not make sense for
negative assertions.
-
+ </para>
+ <para>
Assertions count towards the maximum of 200 parenthesized
subpatterns.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.onlyonce">
<title>Once-only subpatterns</title>
- <literallayout>
+ <para>
With both maximizing and minimizing repetition, failure of
what follows normally causes the repeated item to be
re-evaluated to see if a different number of repeats allows the
@@ -1381,12 +1472,14 @@
to cause it fail earlier than it otherwise might, when the
author of the pattern knows there is no point in carrying
on.
-
+ </para>
+ <para>
Consider, for example, the pattern \d+foo when applied to
the subject line
- 123456bar
-
+ <literal>123456bar</literal>
+ </para>
+ <para>
After matching all 6 digits and then failing to match "foo",
the normal action of the matcher is to try again with only 5
digits matching the \d+ item, and then with 4, and so on,
@@ -1397,40 +1490,45 @@
the first time. The notation is another kind of special
parenthesis, starting with (?> as in this example:
- (?>\d+)bar
-
+ <literal>(?>\d+)bar</literal>
+ </para>
+ <para>
This kind of parenthesis "locks up" the part of the pattern
it contains once it has matched, and a failure further into
the pattern is prevented from backtracking into it.
Backtracking past it to previous items, however, works as normal.
-
+ </para>
+ <para>
An alternative description is that a subpattern of this type
matches the string of characters that an identical standalone
pattern would match, if anchored at the current point
in the subject string.
-
+ </para>
+ <para>
Once-only subpatterns are not capturing subpatterns. Simple
cases such as the above example can be thought of as a maximizing
repeat that must swallow everything it can. So,
while both \d+ and \d+? are prepared to adjust the number of
digits they match in order to make the rest of the pattern
match, (?>\d+) can only match an entire sequence of digits.
-
+ </para>
+ <para>
This construction can of course contain arbitrarily complicated
subpatterns, and it can be nested.
-
+ </para>
+ <para>
Once-only subpatterns can be used in conjunction with
look-behind assertions to specify efficient matching at the end
of the subject string. Consider a simple pattern such as
- abcd$
+ <literal>abcd$</literal>
when applied to a long string which does not match. Because
matching proceeds from left to right, PCRE will look for
each "a" in the subject and then see if what follows matches
the rest of the pattern. If the pattern is specified as
- ^.*abcd$
+ <literal>^.*abcd$</literal>
then the initial .* matches the entire string at first, but
when this fails (because there is no following "a"), it
@@ -1439,28 +1537,29 @@
for "a" covers the entire string, from right to left, so we
are no better off. However, if the pattern is written as
- ^(?>.*)(?<=abcd)
+ <literal>^(?>.*)(?<=abcd)</literal>
then there can be no backtracking for the .* item; it can
match only the entire string. The subsequent lookbehind
assertion does a single test on the last four characters. If
it fails, the match fails immediately. For long strings,
this approach makes a significant difference to the processing time.
-
+ </para>
+ <para>
When a pattern contains an unlimited repeat inside a subpattern
that can itself be repeated an unlimited number of
times, the use of a once-only subpattern is the only way to
avoid some failing matches taking a very long time indeed.
The pattern
- (\D+|<\d+>)*[!?]
+ <literal>(\D+|<\d+>)*[!?]</literal>
matches an unlimited number of substrings that either consist
of non-digits, or digits enclosed in <>, followed by
either ! or ?. When it matches, it runs quickly. However, if
it is applied to
- aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+ <literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
it takes a long time before reporting failure. This is
because the string can be divided between the two repeats in
@@ -1472,29 +1571,33 @@
match, and fail early if it is not present in the string.)
If the pattern is changed to
- ((?>\D+)|<\d+>)*[!?]
+ <literal>((?>\D+)|<\d+>)*[!?]</literal>
sequences of non-digits cannot be broken, and failure happens quickly.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.conditional">
<title>Conditional subpatterns</title>
- <literallayout>
+ <para>
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative
subpatterns, depending on the result of an assertion, or
whether a previous capturing subpattern matched or not. The
two possible forms of conditional subpattern are
+ </para>
+ <literallayout>
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
-
+ </literallayout>
+ <para>
If the condition is satisfied, the yes-pattern is used; otherwise
the no-pattern (if present) is used. If there are
more than two alternatives in the subpattern, a compile-time
error occurs.
-
+ </para>
+ <para>
There are two kinds of condition. If the text between the
parentheses consists of a sequence of digits, then the
condition is satisfied if the capturing subpattern of that
@@ -1503,8 +1606,9 @@
more readable (assume the <link
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option) and to
divide it into three parts for ease of discussion:
- ( \( )? [^()]+ (?(1) \) )
-
+ <literal>( \( )? [^()]+ (?(1) \) )</literal>
+ </para>
+ <para>
The first part matches an optional opening parenthesis, and
if that character is present, sets it as the first captured
substring. The second part matches one or more characters
@@ -1517,16 +1621,20 @@
subpattern matches nothing. In other words, this pattern
matches a sequence of non-parentheses, optionally enclosed
in parentheses.
-
+ </para>
+ <para>
If the condition is not a sequence of digits, it must be an
assertion. This may be a positive or negative lookahead or
lookbehind assertion. Consider this pattern, again containing
non-significant white space, and with the two alternatives on
the second line:
+ </para>
+ <literallayout>
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
-
+ </literallayout>
+ <para>
The condition is a positive lookahead assertion that matches
an optional sequence of non-letters followed by a letter. In
other words, it tests for the presence of at least one
@@ -1535,26 +1643,27 @@
matched against the second. This pattern matches strings in
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.comments">
<title>Comments</title>
- <literallayout>
+ <para>
The sequence (?# marks the start of a comment which
continues up to the next closing parenthesis. Nested
parentheses are not permitted. The characters that make up a
comment play no part in the pattern matching at all.
-
+ </para>
+ <para>
If the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> option is
set, an unescaped # character
outside a character class introduces a comment that
continues up to the next newline character in the pattern.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.recursive">
<title>Recursive patterns</title>
- <literallayout>
+ <para>
Consider the problem of matching a string in parentheses,
allowing for unlimited nested parentheses. Without the use
of recursion, the best that can be done is to use a pattern
@@ -1568,41 +1677,43 @@
option is set so that white space is
ignored):
- \( ( (?>[^()]+) | (?R) )* \)
-
+ <literal>\( ( (?>[^()]+) | (?R) )* \)</literal>
+ </para>
+ <para>
First it matches an opening parenthesis. Then it matches any
number of substrings which can either be a sequence of
non-parentheses, or a recursive match of the pattern itself
(i.e. a correctly parenthesized substring). Finally there is
a closing parenthesis.
-
+ </para>
+ <para>
This particular example pattern contains nested unlimited
repeats, and so the use of a once-only subpattern for matching
strings of non-parentheses is important when applying
the pattern to strings that do not match. For example, when
it is applied to
- (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
+ <literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
it yields "no match" quickly. However, if a once-only subpattern
is not used, the match runs for a very long time
indeed because there are so many different ways the + and *
repeats can carve up the subject, and all have to be tested
before failure can be reported.
-
+ </para>
+ <para>
The values set for any capturing subpatterns are those from
the outermost level of the recursion at which the subpattern
value is set. If the pattern above is matched against
- (ab(cd)ef)
+ <literal>(ab(cd)ef)</literal>
the value for the capturing parentheses is "ef", which is
the last value taken on at the top level. If additional
parentheses are added, giving
- \( ( ( (?>[^()]+) | (?R) )* ) \)
- ^ ^
- ^ ^ then the string they capture
+ <literal>\( <emphasis>(</emphasis> ( (?>[^()]+) | (?R) )*
<emphasis>)</emphasis> \)</literal>
+ then the string they capture
is "ab(cd)ef", the contents of the top level parentheses. If
there are more than 15 capturing parentheses in a pattern,
PCRE has to obtain extra memory to store data during a
@@ -1611,12 +1722,12 @@
saves data for the first 15 capturing parentheses only, as
there is no way to give an out-of-memory error from within a
recursion.
- </literallayout>
+ </para>
</refsect2>
<refsect2 id="regexp.reference.performances">
<title>Performances</title>
- <literallayout>
+ <para>
Certain items that may appear in patterns are more efficient
than others. It is more efficient to use a character class
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
@@ -1624,7 +1735,8 @@
required behaviour is usually the most efficient. Jeffrey
Friedl's book contains a lot of discussion about optimizing
regular expressions for efficient performance.
-
+ </para>
+ <para>
When a pattern begins with .* and the <link
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> option is
set, the pattern is implicitly anchored by PCRE, since it
can match only at the start of a subject string. However, if
@@ -1634,25 +1746,28 @@
match from the character immediately following one of them
instead of from the very start. For example, the pattern
- (.*) second
+ <literal>(.*) second</literal>
matches the subject "first\nand second" (where \n stands for
a newline character) with the first captured substring being
"and". In order to do this, PCRE has to retry the match
starting after every newline in the subject.
-
+ </para>
+ <para>
If you are using such a pattern with subject strings that do
not contain newlines, the best performance is obtained by
- setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> , or starting
the pattern with ^.* to
+ setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>, or starting
the pattern with ^.* to
indicate explicit anchoring. That saves PCRE from having to
scan along the subject looking for a newline to restart at.
-
+ </para>
+ <para>
Beware of patterns that contain nested indefinite repeats.
These can take a long time to run when applied to a string
that does not match. Consider the pattern fragment
- (a+)*
-
+ <literal>(a+)*</literal>
+ </para>
+ <para>
This can match "aaaa" in 33 different ways, and this number
increases very rapidly as the string gets longer. (The *
repeat can match 0, 1, 2, 3, or 4 times, and for each of
@@ -1661,11 +1776,12 @@
that the entire match is going to fail, PCRE has in principle
to try every possible variation, and this can take an
extremely long time.
-
+ </para>
+ <para>
An optimization catches some of the more simple cases such
as
- (a+)*b
+ <literal>(a+)*b</literal>
where a literal character follows. Before embarking on the
standard matching procedure, PCRE checks that there is a "b"
@@ -1674,13 +1790,13 @@
literal this optimization cannot be used. You can see the
difference by comparing the behaviour of
- (a+)*\d
+ <literal>(a+)*\d</literal>
with the pattern above. The former gives a failure almost
instantly when applied to a whole line of "a" characters,
whereas the latter takes an appreciable time with strings
longer than about 20 characters.
- </literallayout>
+ </para>
</refsect2>
</refsect1>
</refentry>