aidan Mon Dec 6 22:29:17 2004 EDT
Modified files: /phpdoc/en/reference/pcre pattern.syntax.xml Log: whitespace fixes
http://cvs.php.net/diff.php/phpdoc/en/reference/pcre/pattern.syntax.xml?r1=1.4&r2=1.5&ty=u Index: phpdoc/en/reference/pcre/pattern.syntax.xml diff -u phpdoc/en/reference/pcre/pattern.syntax.xml:1.4 phpdoc/en/reference/pcre/pattern.syntax.xml:1.5 --- phpdoc/en/reference/pcre/pattern.syntax.xml:1.4 Wed Aug 11 16:15:29 2004 +++ phpdoc/en/reference/pcre/pattern.syntax.xml Mon Dec 6 22:29:16 2004 @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="iso-8859-1"?> -<!-- $Revision: 1.4 $ --> +<!-- $Revision: 1.5 $ --> <!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 --> <refentry id="reference.pcre.pattern.syntax"> <refnamediv> @@ -38,109 +38,105 @@ </listitem> <listitem> <simpara> - PCRE does not allow repeat quantifiers on lookahead - assertions. Perl permits them, but they do not mean what you - might think. For example, (?!a){3} does not assert that the - next three characters are not "a". It just asserts that the - next character is not "a" three times. + PCRE does not allow repeat quantifiers on lookahead + assertions. Perl permits them, but they do not mean what you + might think. For example, (?!a){3} does not assert that the + next three characters are not "a". It just asserts that the + next character is not "a" three times. </simpara> </listitem> <listitem> <simpara> - Capturing subpatterns that occur inside negative - lookahead assertions are counted, but their entries in the - offsets vector are never set. Perl sets its numerical - variables from any such patterns that are matched before the - assertion fails to match something (thereby succeeding), but - only if the negative lookahead assertion contains just one - branch. + Capturing subpatterns that occur inside negative + lookahead assertions are counted, but their entries in the + offsets vector are never set. Perl sets its numerical + variables from any such patterns that are matched before the + assertion fails to match something (thereby succeeding), but + only if the negative lookahead assertion contains just one + branch. </simpara> </listitem> <listitem> <simpara> - Though binary zero characters are supported in the subject string, - they are not allowed in a pattern string because it is passed as a - normal C string, terminated by zero. The escape sequence "\\x00" can - be used in the pattern to represent a binary zero. + Though binary zero characters are supported in the subject string, + they are not allowed in a pattern string because it is passed as a + normal C string, terminated by zero. The escape sequence "\\x00" can + be used in the pattern to represent a binary zero. </simpara> </listitem> <listitem> <simpara> - The following Perl escape sequences are not supported: - \l, \u, \L, \U, \E, \Q. In fact these are implemented by - Perl's general string-handling and are not part of its - pattern matching engine. + The following Perl escape sequences are not supported: + \l, \u, \L, \U, \E, \Q. In fact these are implemented by + Perl's general string-handling and are not part of its + pattern matching engine. </simpara> </listitem> <listitem> <simpara> - The Perl \G assertion is not supported as it is not - relevant to single pattern matches. + The Perl \G assertion is not supported as it is not + relevant to single pattern matches. </simpara> </listitem> <listitem> <simpara> - Fairly obviously, PCRE does not support the (?{code}) - construction. + Fairly obviously, PCRE does not support the (?{code}) + construction. </simpara> </listitem> <listitem> <simpara> - There are at the time of writing some oddities in Perl - 5.005_02 concerned with the settings of captured strings - when part of a pattern is repeated. For example, matching - "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value - "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 - unset. However, if the pattern is changed to - /^(aa(b(b))?)+$/ then $2 (and $3) get set. - In Perl 5.004 $2 is set in both cases, and that is also &true; - of PCRE. If in the future Perl changes to a consistent state - that is different, PCRE may change to follow. + There are at the time of writing some oddities in Perl + 5.005_02 concerned with the settings of captured strings + when part of a pattern is repeated. For example, matching + "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value + "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 + unset. However, if the pattern is changed to + /^(aa(b(b))?)+$/ then $2 (and $3) get set. + In Perl 5.004 $2 is set in both cases, and that is also &true; + of PCRE. If in the future Perl changes to a consistent state + that is different, PCRE may change to follow. </simpara> </listitem> <listitem> <simpara> - Another as yet unresolved discrepancy is that in Perl - 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string - "a", whereas in PCRE it does not. However, in both Perl and - PCRE /^(a)?a/ matched against "a" leaves $1 unset. + Another as yet unresolved discrepancy is that in Perl + 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string + "a", whereas in PCRE it does not. However, in both Perl and + PCRE /^(a)?a/ matched against "a" leaves $1 unset. </simpara> </listitem> <listitem> <para> - PCRE provides some extensions to the Perl regular - expression facilities: + PCRE provides some extensions to the Perl regular + expression facilities: <orderedlist> <listitem> <simpara> - Although lookbehind assertions must match fixed length - strings, each alternative branch of a lookbehind assertion - can match a different length of string. Perl 5.005 requires - them all to have the same length. + Although lookbehind assertions must match fixed length + strings, each alternative branch of a lookbehind assertion + can match a different length of string. Perl 5.005 requires + them all to have the same length. </simpara> </listitem> <listitem> <simpara> - If <link - linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> - is set and <link - linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is + If <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> + is set and <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is not set, the $ meta-character matches only at the very end of the string. </simpara> </listitem> <listitem> <simpara> - If <link - linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> is + If <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> is set, a backslash followed by a letter with no special meaning is faulted. </simpara> </listitem> <listitem> <simpara> - If <link - linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> is + If <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> is set, the greediness of the repetition quantifiers is inverted, that is, by default they are not greedy, but if followed by a question mark they are. @@ -155,307 +151,202 @@ <refsect1 id="regexp.reference"> <title>Regular Expression Details</title> - <refsect2 id="regexp.introduction"> - <title>Introduction</title> - <para> - The syntax and semantics of the regular expressions - supported by PCRE are described below. Regular expressions are - also described in the Perl documentation and in a number of - other books, some of which have copious examples. Jeffrey - Friedl's "Mastering Regular Expressions", published by - O'Reilly (ISBN 1-56592-257-3), covers them in great detail. - The description here is intended as reference documentation. - </para> - <para> - A regular expression is a pattern that is matched against a - subject string from left to right. Most characters stand for - themselves in a pattern, and match the corresponding - characters in the subject. As a trivial example, the pattern - <literal>The quick brown fox</literal> - matches a portion of a subject string that is identical to - itself. - </para> + <refsect2 id="regexp.introduction"> + <title>Introduction</title> + <para> + The syntax and semantics of the regular expressions + supported by PCRE are described below. Regular expressions are + also described in the Perl documentation and in a number of + other books, some of which have copious examples. Jeffrey + Friedl's "Mastering Regular Expressions", published by + O'Reilly (ISBN 1-56592-257-3), covers them in great detail. + The description here is intended as reference documentation. + </para> + <para> + A regular expression is a pattern that is matched against a + subject string from left to right. Most characters stand for + themselves in a pattern, and match the corresponding + characters in the subject. As a trivial example, the pattern + <literal>The quick brown fox</literal> + matches a portion of a subject string that is identical to + itself. + </para> </refsect2> <refsect2 id="regexp.reference.meta"> <title>Meta-characters</title> <para> - The power of regular expressions comes from the - ability to include alternatives and repetitions in the - pattern. These are encoded in the pattern by the use of - <emphasis>meta-characters</emphasis>, which do not stand for themselves but instead - are interpreted in some special way. - </para> - <para> - There are two different sets of meta-characters: those that - are recognized anywhere in the pattern except within square - brackets, and those that are recognized in square brackets. - Outside square brackets, the meta-characters are as follows: + The power of regular expressions comes from the + ability to include alternatives and repetitions in the + pattern. These are encoded in the pattern by the use of + <emphasis>meta-characters</emphasis>, which do not stand for themselves but instead + are interpreted in some special way. + </para> + <para> + There are two different sets of meta-characters: those that + are recognized anywhere in the pattern except within square + brackets, and those that are recognized in square brackets. + Outside square brackets, the meta-characters are as follows: <variablelist> <varlistentry> <term><emphasis>\</emphasis></term> - <listitem> - <simpara> - general escape character with several uses - </simpara> - </listitem> + <listitem><simpara>general escape character with several uses</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>^</emphasis></term> - <listitem> - <simpara> - assert start of subject (or line, in multiline mode) - </simpara> - </listitem> + <listitem><simpara>assert start of subject (or line, in multiline mode)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>$</emphasis></term> - <listitem> - <simpara> - assert end of subject (or line, in multiline mode) - </simpara> - </listitem> + <listitem><simpara>assert end of subject (or line, in multiline mode)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>.</emphasis></term> - <listitem> - <simpara> - match any character except newline (by default) - </simpara> - </listitem> + <listitem><simpara>match any character except newline (by default)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>[</emphasis></term> - <listitem> - <simpara> - start character class definition - </simpara> - </listitem> + <listitem><simpara>start character class definition</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>]</emphasis></term> - <listitem> - <simpara> - end character class definition - </simpara> - </listitem> + <listitem><simpara>end character class definition</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>|</emphasis></term> - <listitem> - <simpara> - start of alternative branch - </simpara> - </listitem> + <listitem><simpara>start of alternative branch</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>(</emphasis></term> - <listitem> - <simpara> - start subpattern - </simpara> - </listitem> + <listitem><simpara>start subpattern</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>)</emphasis></term> - <listitem> - <simpara> - end subpattern - </simpara> - </listitem> + <listitem><simpara>end subpattern</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>?</emphasis></term> - <listitem> - <simpara> - extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer - </simpara> - </listitem> + <listitem><simpara>extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>*</emphasis></term> - <listitem> - <simpara> - 0 or more quantifier - </simpara> - </listitem> + <listitem><simpara>0 or more quantifier</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>+</emphasis></term> - <listitem> - <simpara> - 1 or more quantifier - </simpara> - </listitem> + <listitem><simpara>1 or more quantifier</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>{</emphasis></term> - <listitem> - <simpara> - start min/max quantifier - </simpara> - </listitem> + <listitem><simpara>start min/max quantifier</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>}</emphasis></term> - <listitem> - <simpara> - end min/max quantifier - </simpara> - </listitem> + <listitem><simpara>end min/max quantifier</simpara></listitem> </varlistentry> </variablelist> - Part of a pattern that is in square brackets is called a - "character class". In a character class the only - meta-characters are: + Part of a pattern that is in square brackets is called a + "character class". In a character class the only + meta-characters are: + <variablelist> <varlistentry> <term><emphasis>\</emphasis></term> - <listitem> - <simpara> - general escape character - </simpara> - </listitem> + <listitem><simpara>general escape character</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>^</emphasis></term> - <listitem> - <simpara> - negate the class, but only if the first character - </simpara> - </listitem> + <listitem><simpara>negate the class, but only if the first character</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>-</emphasis></term> - <listitem> - <simpara> - indicates character range - </simpara> - </listitem> + <listitem><simpara>indicates character range</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>]</emphasis></term> - <listitem> - <simpara> - terminates the character class - </simpara> - </listitem> + <listitem><simpara>terminates the character class</simpara></listitem> </varlistentry> </variablelist> - The following sections describe the use of each of the - meta-characters. - </para> + + The following sections describe the use of each of the + meta-characters. + </para> </refsect2> - <refsect2 id="regexp.reference.backslash"> - <title>backslash</title> + + <refsect2 id="regexp.reference.backslash"> + <title>backslash</title> + <para> + The backslash character has several uses. Firstly, if it is + followed by a non-alphanumeric character, it takes away any + special meaning that character may have. This use of + backslash as an escape character applies both inside and + outside character classes. + </para> + <para> + For example, if you want to match a "*" character, you write + "\*" in the pattern. This applies whether or not the + following character would otherwise be interpreted as a + meta-character, so it is always safe to precede a non-alphanumeric + with "\" to specify that it stands for itself. In + particular, if you want to match a backslash, you write "\\". + </para> + <para> + If a pattern is compiled with the + <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option, + whitespace in the pattern (other than in a character class) and + characters between a "#" outside a character class and the next newline + character are ignored. An escaping backslash can be used to include a + whitespace or "#" character as part of the pattern. + </para> + <para> + A second use of backslash provides a way of encoding + non-printing characters in patterns in a visible manner. There + is no restriction on the appearance of non-printing characters, + apart from the binary zero that terminates a pattern, + but when a pattern is being prepared by text editing, it is + usually easier to use one of the following escape sequences + than the binary character it represents: + </para> <para> - The backslash character has several uses. Firstly, if it is - followed by a non-alphanumeric character, it takes away any - special meaning that character may have. This use of - backslash as an escape character applies both inside and - outside character classes. - </para> - <para> - For example, if you want to match a "*" character, you write - "\*" in the pattern. This applies whether or not the - following character would otherwise be interpreted as a - meta-character, so it is always safe to precede a non-alphanumeric - with "\" to specify that it stands for itself. In - particular, if you want to match a backslash, you write "\\". - </para> - <para> - If a pattern is compiled with the <link - linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option, - whitespace in the pattern (other than in a character class) and - characters between a "#" outside a character class and the next newline - character are ignored. An escaping backslash can be used to include a - whitespace or "#" character as part of the pattern. - </para> - <para> - A second use of backslash provides a way of encoding - non-printing characters in patterns in a visible manner. There - is no restriction on the appearance of non-printing characters, - apart from the binary zero that terminates a pattern, - but when a pattern is being prepared by text editing, it is - usually easier to use one of the following escape sequences - than the binary character it represents: - </para> - <para> <variablelist> <varlistentry> <term><emphasis>\a</emphasis></term> - <listitem> - <simpara> - alarm, that is, the BEL character (hex 07) - </simpara> - </listitem> + <listitem><simpara>alarm, that is, the BEL character (hex 07)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\cx</emphasis></term> - <listitem> - <simpara> - "control-x", where x is any character - </simpara> - </listitem> + <listitem><simpara>"control-x", where x is any character</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\e</emphasis></term> - <listitem> - <simpara> - escape (hex 1B) - </simpara> - </listitem> + <listitem><simpara>escape (hex 1B)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\f</emphasis></term> - <listitem> - <simpara> - formfeed (hex 0C) - </simpara> - </listitem> + <listitem><simpara>formfeed (hex 0C)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\n</emphasis></term> - <listitem> - <simpara> - newline (hex 0A) - </simpara> - </listitem> + <listitem><simpara>newline (hex 0A)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\r</emphasis></term> - <listitem> - <simpara> - carriage return (hex 0D) - </simpara> - </listitem> + <listitem><simpara>carriage return (hex 0D)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\t</emphasis></term> - <listitem> - <simpara> - tab (hex 09) - </simpara> - </listitem> + <listitem><simpara>tab (hex 09)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\xhh</emphasis></term> - <listitem> - <simpara> - character with hex code hh - </simpara> - </listitem> + <listitem><simpara>character with hex code hh</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\ddd</emphasis></term> - <listitem> - <simpara> - character with octal code ddd, or backreference - </simpara> - </listitem> + <listitem><simpara>character with octal code ddd, or backreference</simpara></listitem> </varlistentry> </variablelist> - </para> + </para> <para> The precise effect of "<literal>\cx</literal>" is as follows: if "<literal>x</literal>" is a lower case letter, it is converted @@ -496,83 +387,63 @@ stand for themselves. For example: </para> <para> - <variablelist> - <varlistentry> - <term><emphasis>\040</emphasis></term> - <listitem> - <simpara> - is another way of writing a space - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\40</emphasis></term> - <listitem> - <simpara> - is the same, provided there are fewer than 40 - previous capturing subpatterns - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\7</emphasis></term> - <listitem> - <simpara> - is always a back reference - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\11</emphasis></term> - <listitem> - <simpara> - might be a back reference, or another way of - writing a tab - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\011</emphasis></term> - <listitem> - <simpara> - is always a tab - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\0113</emphasis></term> - <listitem> - <simpara> - is a tab followed by the character "3" - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\113</emphasis></term> - <listitem> - <simpara> - is the character with octal code 113 (since there - can be no more than 99 back references) - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\377</emphasis></term> - <listitem> - <simpara> - is a byte consisting entirely of 1 bits - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\81</emphasis></term> - <listitem> - <simpara> - is either a back reference, or a binary zero - followed by the two characters "8" and "1" - </simpara> - </listitem> - </varlistentry> + <variablelist> + <varlistentry> + <term><emphasis>\040</emphasis></term> + <listitem><simpara>is another way of writing a space</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\40</emphasis></term> + <listitem> + <simpara> + is the same, provided there are fewer than 40 + previous capturing subpatterns + </simpara> + </listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\7</emphasis></term> + <listitem><simpara>is always a back reference</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\11</emphasis></term> + <listitem> + <simpara> + might be a back reference, or another way of + writing a tab + </simpara> + </listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\011</emphasis></term> + <listitem><simpara>is always a tab</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\0113</emphasis></term> + <listitem><simpara>is a tab followed by the character "3"</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\113</emphasis></term> + <listitem> + <simpara> + is the character with octal code 113 (since there + can be no more than 99 back references) + </simpara> + </listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\377</emphasis></term> + <listitem><simpara>is a byte consisting entirely of 1 bits</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\81</emphasis></term> + <listitem> + <simpara> + is either a back reference, or a binary zero + followed by the two characters "8" and "1" + </simpara> + </listitem> + </varlistentry> </variablelist> </para> <para> @@ -592,56 +463,32 @@ character types: </para> <para> - <variablelist> - <varlistentry> - <term><emphasis>\d</emphasis></term> - <listitem> - <simpara> - any decimal digit - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\D</emphasis></term> - <listitem> - <simpara> - any character that is not a decimal digit - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\s</emphasis></term> - <listitem> - <simpara> - any whitespace character - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\S</emphasis></term> - <listitem> - <simpara> - any character that is not a whitespace character - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\w</emphasis></term> - <listitem> - <simpara> - any "word" character - </simpara> - </listitem> - </varlistentry> - <varlistentry> - <term><emphasis>\W</emphasis></term> - <listitem> - <simpara> - any "non-word" character - </simpara> - </listitem> - </varlistentry> - </variablelist> + <variablelist> + <varlistentry> + <term><emphasis>\d</emphasis></term> + <listitem><simpara>any decimal digit</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\D</emphasis></term> + <listitem><simpara>any character that is not a decimal digit</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\s</emphasis></term> + <listitem><simpara>any whitespace character</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\S</emphasis></term> + <listitem><simpara>any character that is not a whitespace character</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\w</emphasis></term> + <listitem><simpara>any "word" character</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\W</emphasis></term> + <listitem><simpara>any "non-word" character</simpara></listitem> + </varlistentry> + </variablelist> </para> <para> Each pair of escape sequences partitions the complete set of @@ -677,44 +524,28 @@ <variablelist> <varlistentry> <term><emphasis>\b</emphasis></term> - <listitem> - <simpara> - word boundary - </simpara> - </listitem> + <listitem><simpara>word boundary</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\B</emphasis></term> - <listitem> - <simpara> - not a word boundary - </simpara> - </listitem> + <listitem><simpara>not a word boundary</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\A</emphasis></term> - <listitem> - <simpara> - start of subject (independent of multiline mode) - </simpara> - </listitem> + <listitem><simpara>start of subject (independent of multiline mode)</simpara></listitem> </varlistentry> <varlistentry> <term><emphasis>\Z</emphasis></term> - <listitem> + <listitem> <simpara> - end of subject or newline at end (independent of - multiline mode) + end of subject or newline at end (independent of + multiline mode) </simpara> </listitem> </varlistentry> <varlistentry> <term><emphasis>\z</emphasis></term> - <listitem> - <simpara> - end of subject(independent of multiline mode) - </simpara> - </listitem> + <listitem><simpara>end of subject(independent of multiline mode)</simpara></listitem> </varlistentry> </variablelist> </para> @@ -738,8 +569,7 @@ ever match at the very start and end of the subject string, whatever options are set. They are not affected by the <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or - <link - linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> + <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> options. The difference between <literal>\Z</literal> and <literal>\z</literal> is that <literal>\Z</literal> matches before a newline that is the last character of the string as well as at the end of @@ -750,60 +580,59 @@ <refsect2 id="regexp.reference.circudollar"> <title>Circumflex and dollar</title> <para> - Outside a character class, in the default matching mode, the - circumflex character is an assertion which is true only if - the current matching point is at the start of the subject - string. Inside a character class, circumflex has an entirely - different meaning (see below). - </para> - <para> - Circumflex need not be the first character of the pattern if - a number of alternatives are involved, but it should be the - first thing in each alternative in which it appears if the - pattern is ever to match that branch. If all possible - alternatives start with a circumflex, that is, if the pattern is - constrained to match only at the start of the subject, it is - said to be an "anchored" pattern. (There are also other - constructs that can cause a pattern to be anchored.) - </para> - <para> - A dollar character is an assertion which is &true; only if the - current matching point is at the end of the subject string, - or immediately before a newline character that is the last - character in the string (by default). Dollar need not be the - last character of the pattern if a number of alternatives - are involved, but it should be the last item in any branch - in which it appears. Dollar has no special meaning in a - character class. - </para> - <para> - The meaning of dollar can be changed so that it matches only - at the very end of the string, by setting the - <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> - option at compile or matching time. This - does not affect the \Z assertion. - </para> - <para> - The meanings of the circumflex and dollar characters are - changed if the <link - linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> option - is set. When this is the case, they match immediately after and - immediately before an internal "\n" character, respectively, in addition - to matching at the start and end of the subject string. For example, the - pattern /^abc$/ matches the subject string "def\nabc" in multiline mode, - but not otherwise. Consequently, patterns that are anchored in single - line mode because all branches start with "^" are not anchored in - multiline mode. The <link - linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> - option is ignored if <link - linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is - set. - </para> - <para> - Note that the sequences \A, \Z, and \z can be used to match - the start and end of the subject in both modes, and if all - branches of a pattern start with \A is it always anchored, - whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is set or not. + Outside a character class, in the default matching mode, the + circumflex character is an assertion which is true only if + the current matching point is at the start of the subject + string. Inside a character class, circumflex has an entirely + different meaning (see below). + </para> + <para> + Circumflex need not be the first character of the pattern if + a number of alternatives are involved, but it should be the + first thing in each alternative in which it appears if the + pattern is ever to match that branch. If all possible + alternatives start with a circumflex, that is, if the pattern is + constrained to match only at the start of the subject, it is + said to be an "anchored" pattern. (There are also other + constructs that can cause a pattern to be anchored.) + </para> + <para> + A dollar character is an assertion which is &true; only if the + current matching point is at the end of the subject string, + or immediately before a newline character that is the last + character in the string (by default). Dollar need not be the + last character of the pattern if a number of alternatives + are involved, but it should be the last item in any branch + in which it appears. Dollar has no special meaning in a + character class. + </para> + <para> + The meaning of dollar can be changed so that it matches only + at the very end of the string, by setting the + <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> + option at compile or matching time. This does not affect the \Z assertion. + </para> + <para> + The meanings of the circumflex and dollar characters are + changed if the + <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> option + is set. When this is the case, they match immediately after and + immediately before an internal "\n" character, respectively, in addition + to matching at the start and end of the subject string. For example, the + pattern /^abc$/ matches the subject string "def\nabc" in multiline mode, + but not otherwise. Consequently, patterns that are anchored in single + line mode because all branches start with "^" are not anchored in + multiline mode. The + <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link> + option is ignored if + <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is + set. + </para> + <para> + Note that the sequences \A, \Z, and \z can be used to match + the start and end of the subject in both modes, and if all + branches of a pattern start with \A is it always anchored, + whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> is set or not. </para> </refsect2> @@ -812,8 +641,8 @@ <para> Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing - character, but not (by default) newline. If the <link - linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> + character, but not (by default) newline. If the + <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is set, then dots match newlines as well. The handling of dot is entirely independent of the handling of circumflex and dollar, the only relationship being that they @@ -825,90 +654,90 @@ <refsect2 id="regexp.reference.squarebrackets"> <title>Square brackets</title> <para> - An opening square bracket introduces a character class, - terminated by a closing square bracket. A closing square - bracket on its own is not special. If a closing square - bracket is required as a member of the class, it should be - the first data character in the class (after an initial - circumflex, if present) or escaped with a backslash. - </para> - <para> - A character class matches a single character in the subject; - the character must be in the set of characters defined by - the class, unless the first character in the class is a - circumflex, in which case the subject character must not be in - the set defined by the class. If a circumflex is actually - required as a member of the class, ensure it is not the - first character, or escape it with a backslash. - </para> - <para> - For example, the character class [aeiou] matches any lower - case vowel, while [^aeiou] matches any character that is not - a lower case vowel. Note that a circumflex is just a - convenient notation for specifying the characters which are in - the class by enumerating those that are not. It is not an - assertion: it still consumes a character from the subject - string, and fails if the current pointer is at the end of - the string. - </para> - <para> - When caseless matching is set, any letters in a class - represent both their upper case and lower case versions, so - for example, a caseless [aeiou] matches "A" as well as "a", - and a caseless [^aeiou] does not match "A", whereas a - caseful version would. - </para> - <para> - The newline character is never treated in any special way in - character classes, whatever the setting of the <link - linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> - or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> - options is. A class such as [^a] will always match a newline. - </para> - <para> - The minus (hyphen) character can be used to specify a range - of characters in a character class. For example, [d-m] - matches any letter between d and m, inclusive. If a minus - character is required in a class, it must be escaped with a - backslash or appear in a position where it cannot be - interpreted as indicating a range, typically as the first or last - character in the class. - </para> - <para> - It is not possible to have the literal character "]" as the - end character of a range. A pattern such as [W-]46] is - interpreted as a class of two characters ("W" and "-") - followed by a literal string "46]", so it would match "W46]" or - "-46]". However, if the "]" is escaped with a backslash it - is interpreted as the end of range, so [W-\]46] is - interpreted as a single class containing a range followed by two - separate characters. The octal or hexadecimal representation - of "]" can also be used to end a range. - </para> - <para> - Ranges operate in ASCII collating sequence. They can also be - used for characters specified numerically, for example - [\000-\037]. If a range that includes letters is used when - caseless matching is set, it matches the letters in either - case. For example, [W-c] is equivalent to [][\^_`wxyzabc], - matched caselessly, and if character tables for the "fr" - locale are in use, [\xc8-\xcb] matches accented E characters - in both cases. - </para> - <para> - The character types \d, \D, \s, \S, \w, and \W may also - appear in a character class, and add the characters that - they match to the class. For example, [\dABCDEF] matches any - hexadecimal digit. A circumflex can conveniently be used - with the upper case character types to specify a more - restricted set of characters than the matching lower case type. - For example, the class [^\W_] matches any letter or digit, - but not underscore. - </para> - <para> - All non-alphanumeric characters other than \, -, ^ (at the - start) and the terminating ] are non-special in character - classes, but it does no harm if they are escaped. + An opening square bracket introduces a character class, + terminated by a closing square bracket. A closing square + bracket on its own is not special. If a closing square + bracket is required as a member of the class, it should be + the first data character in the class (after an initial + circumflex, if present) or escaped with a backslash. + </para> + <para> + A character class matches a single character in the subject; + the character must be in the set of characters defined by + the class, unless the first character in the class is a + circumflex, in which case the subject character must not be in + the set defined by the class. If a circumflex is actually + required as a member of the class, ensure it is not the + first character, or escape it with a backslash. + </para> + <para> + For example, the character class [aeiou] matches any lower + case vowel, while [^aeiou] matches any character that is not + a lower case vowel. Note that a circumflex is just a + convenient notation for specifying the characters which are in + the class by enumerating those that are not. It is not an + assertion: it still consumes a character from the subject + string, and fails if the current pointer is at the end of + the string. + </para> + <para> + When caseless matching is set, any letters in a class + represent both their upper case and lower case versions, so + for example, a caseless [aeiou] matches "A" as well as "a", + and a caseless [^aeiou] does not match "A", whereas a + caseful version would. + </para> + <para> + The newline character is never treated in any special way in + character classes, whatever the setting of the <link + linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> + or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> + options is. A class such as [^a] will always match a newline. + </para> + <para> + The minus (hyphen) character can be used to specify a range + of characters in a character class. For example, [d-m] + matches any letter between d and m, inclusive. If a minus + character is required in a class, it must be escaped with a + backslash or appear in a position where it cannot be + interpreted as indicating a range, typically as the first or last + character in the class. + </para> + <para> + It is not possible to have the literal character "]" as the + end character of a range. A pattern such as [W-]46] is + interpreted as a class of two characters ("W" and "-") + followed by a literal string "46]", so it would match "W46]" or + "-46]". However, if the "]" is escaped with a backslash it + is interpreted as the end of range, so [W-\]46] is + interpreted as a single class containing a range followed by two + separate characters. The octal or hexadecimal representation + of "]" can also be used to end a range. + </para> + <para> + Ranges operate in ASCII collating sequence. They can also be + used for characters specified numerically, for example + [\000-\037]. If a range that includes letters is used when + caseless matching is set, it matches the letters in either + case. For example, [W-c] is equivalent to [][\^_`wxyzabc], + matched caselessly, and if character tables for the "fr" + locale are in use, [\xc8-\xcb] matches accented E characters + in both cases. + </para> + <para> + The character types \d, \D, \s, \S, \w, and \W may also + appear in a character class, and add the characters that + they match to the class. For example, [\dABCDEF] matches any + hexadecimal digit. A circumflex can conveniently be used + with the upper case character types to specify a more + restricted set of characters than the matching lower case type. + For example, the class [^\W_] matches any letter or digit, + but not underscore. + </para> + <para> + All non-alphanumeric characters other than \, -, ^ (at the + start) and the terminating ] are non-special in character + classes, but it does no harm if they are escaped. </para> </refsect2> @@ -917,9 +746,7 @@ <para> Vertical bar characters are used to separate alternative patterns. For example, the pattern - - <literal>gilbert|sullivan</literal> - + <literal>gilbert|sullivan</literal> matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries @@ -934,104 +761,105 @@ <refsect2 id="regexp.reference.internal-options"> <title>Internal option setting</title> <para> - The settings of <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link>, - <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>, - <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>, - <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>, - and <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> can be changed from within the pattern by - a sequence of Perl option letters enclosed between "(?" and - ")". The option letters are + The settings of <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link>, + <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>, + <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>, + <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>, + and <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> + can be changed from within the pattern by + a sequence of Perl option letters enclosed between "(?" and + ")". The option letters are: + + <table> + <title>Internal option letters</title> + <tgroup cols="2"> + <tbody> + <row> + <entry><literal>i</literal></entry> + <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link></entry> + </row> + <row> + <entry><literal>m</literal></entry> + <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link></entry> + </row> + <row> + <entry><literal>s</literal></entry> + <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link></entry> + </row> + <row> + <entry><literal>x</literal></entry> + <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link></entry> + </row> + <row> + <entry><literal>U</literal></entry> + <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link></entry> + </row> + </tbody> + </tgroup> + </table> + </para> + <para> + For example, (?im) sets caseless, multiline matching. It is + also possible to unset these options by preceding the letter + with a hyphen, and a combined setting and unsetting such as + (?im-sx), which sets <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> while + unsetting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>, is also permitted. + If a letter appears both before and after the hyphen, the + option is unset. + </para> + <para> + The scope of these option changes depends on where in the + pattern the setting occurs. For settings that are outside + any subpattern (defined below), the effect is the same as if + the options were set or unset at the start of matching. The + following patterns all behave in exactly the same way: + </para> - <table> - <title>Internal option letters</title> - <tgroup cols="2"> - <tbody> - <row> - <entry><literal>i</literal></entry> - <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link></entry> - </row> - <row> - <entry><literal>m</literal></entry> - <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link></entry> - </row> - <row> - <entry><literal>s</literal></entry> - <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link></entry> - </row> - <row> - <entry><literal>x</literal></entry> - <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link></entry> - </row> - <row> - <entry><literal>U</literal></entry> - <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link></entry> - </row> - </tbody> - </tgroup> - </table> - </para> - <para> - For example, (?im) sets caseless, multiline matching. It is - also possible to unset these options by preceding the letter - with a hyphen, and a combined setting and unsetting such as - (?im-sx), which sets <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> while - unsetting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>, is also permitted. - If a letter appears both before and after the hyphen, the - option is unset. - </para> - <para> - The scope of these option changes depends on where in the - pattern the setting occurs. For settings that are outside - any subpattern (defined below), the effect is the same as if - the options were set or unset at the start of matching. The - following patterns all behave in exactly the same way: - </para> - - <literallayout> - (?i)abc - a(?i)bc - ab(?i)c - abc(?i) - </literallayout> - - <para> - which in turn is the same as compiling the pattern abc with - <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> set. - In other words, such "top level" settings apply to the whole - pattern (unless there are other changes inside subpatterns). - If there is more than one setting of the same option at top level, - the rightmost setting is used. - </para> - <para> - If an option change occurs inside a subpattern, the effect - is different. This is a change of behaviour in Perl 5.005. - An option change inside a subpattern affects only that part - of the subpattern that follows it, so - - <literal>(a(?i)b)c</literal> - - matches abc and aBc and no other strings (assuming - <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not used). By this means, options can be - made to have different settings in different parts of the - pattern. Any changes made in one alternative do carry on - into subsequent branches within the same subpattern. For - example, - - <literal>(a(?i)b|c)</literal> - - matches "ab", "aB", "c", and "C", even though when matching - "C" the first branch is abandoned before the option setting. - This is because the effects of option settings happen at - compile time. There would be some very weird behaviour otherwise. - </para> - <para> - The PCRE-specific options <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and - <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can - be changed in the same way as the Perl-compatible options by - using the characters U and X respectively. The (?X) flag - setting is special in that it must always occur earlier in - the pattern than any of the additional features it turns on, - even when it is at top level. It is best put at the start. + <literallayout> + (?i)abc + a(?i)bc + ab(?i)c + abc(?i) + </literallayout> + + <para> + which in turn is the same as compiling the pattern abc with + <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> set. + In other words, such "top level" settings apply to the whole + pattern (unless there are other changes inside subpatterns). + If there is more than one setting of the same option at top level, + the rightmost setting is used. + </para> + <para> + If an option change occurs inside a subpattern, the effect + is different. This is a change of behaviour in Perl 5.005. + An option change inside a subpattern affects only that part + of the subpattern that follows it, so + + <literal>(a(?i)b)c</literal> + + matches abc and aBc and no other strings (assuming + <link linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not used). By this means, options can be + made to have different settings in different parts of the + pattern. Any changes made in one alternative do carry on + into subsequent branches within the same subpattern. For + example, + + <literal>(a(?i)b|c)</literal> + + matches "ab", "aB", "c", and "C", even though when matching + "C" the first branch is abandoned before the option setting. + This is because the effects of option settings happen at + compile time. There would be some very weird behaviour otherwise. + </para> + <para> + The PCRE-specific options <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and + <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can + be changed in the same way as the Perl-compatible options by + using the characters U and X respectively. The (?X) flag + setting is special in that it must always occur earlier in + the pattern than any of the additional features it turns on, + even when it is at top level. It is best put at the start. </para> </refsect2>