functions pcre.pattern.syntax.xml

Jakub Vrana Tue, 23 Dec 2003 06:10:30 -0800

vrana           Tue Dec 23 08:07:58 2003 EDT


  Modified files:              
    /phpdoc/en/reference/pcre/functions pcre.pattern.syntax.xml 
  Log:
  literallayout changed to para

Index: phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml
diff -u phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml:1.7 
phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml:1.8
--- phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml:1.7      Fri Dec 19 
10:49:44 2003
+++ phpdoc/en/reference/pcre/functions/pcre.pattern.syntax.xml  Tue Dec 23 08:07:58 
2003
@@ -1,5 +1,5 @@
 <?xml version="1.0" encoding="iso-8859-1"?>
-<!-- $Revision: 1.7 $ -->
+<!-- $Revision: 1.8 $ -->
 <!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
   <refentry id="pcre.pattern.syntax">
    <refnamediv>
@@ -159,7 +159,8 @@
      Friedl's  "Mastering  Regular  Expressions",  published   by
      O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.
      The description here is intended as reference documentation.
-
+    </para>
+    <para>
      A regular expression is a pattern that is matched against  a
      subject string from left to right. Most characters stand for
      themselves in a pattern, and match the corresponding
@@ -742,13 +743,14 @@
 
     <refsect2 id="regexp.reference.circudollar">
      <title>Circumflex and dollar</title>
-     <literallayout>
+     <para>
      Outside a character class, in the default matching mode, the
      circumflex  character  is an assertion which is true only if
      the current matching point is at the start  of  the  subject
      string. Inside a character class, circumflex has an entirely
      different meaning (see below).
-
+    </para>
+    <para>
      Circumflex need not be the first character of the pattern if
      a number of alternatives are involved, but it should be the
      first thing in each alternative in which it appears  if  the
@@ -757,7 +759,8 @@
      constrained to match only at the start of the subject, it is
      said to be an "anchored" pattern. (There are also other
      constructs that can cause a pattern to be anchored.)
-
+    </para>
+    <para>
      A dollar character is an assertion which is &true; only if the
      current  matching point is at the end of the subject string,
      or immediately before a newline character that is  the  last
@@ -766,13 +769,15 @@
      are  involved,  but it should be the last item in any branch
      in which it appears.  Dollar has no  special  meaning  in  a
      character class.
-
+    </para>
+    <para>
      The meaning of dollar can be changed so that it matches only
      at   the   very   end   of   the   string,  by  setting  the
      <link linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
      option at compile or matching time. This
      does not affect the \Z assertion.
-
+    </para>
+    <para>
      The meanings of the circumflex  and  dollar  characters  are
      changed  if  the  <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>  
option is set. When this is
      the case,  they  match  immediately  after  and  immediately
@@ -784,17 +789,18 @@
      because all branches start with "^" are not anchored in
      multiline  mode.  The  <link 
linkend="pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>  option is ignored if
      <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>  is set.
-
+    </para>
+    <para>
      Note that the sequences \A, \Z, and \z can be used to  match
      the  start  and end of the subject in both modes, and if all
      branches of a pattern start with \A is it  always  anchored,
      whether <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>  is set or 
not.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.dot">
      <title>FULL STOP</title>
-     <literallayout>
+     <para>
      Outside a character class, a dot in the pattern matches  any
      one  character  in  the  subject,  including  a non-printing
      character, but not (by default) newline.  If the <link 
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> 
@@ -803,19 +809,20 @@
      circumflex  and  dollar,  the only relationship being that they
      both involve newline characters.  Dot has no special meaning
      in a character class.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.squarebrackets">
      <title>Square brackets</title>
-     <literallayout>
+     <para>
      An opening square bracket introduces a character class,
      terminated  by  a  closing  square  bracket.  A  closing square
      bracket on its own is  not  special.  If  a  closing  square
      bracket  is  required as a member of the class, it should be
      the first data character in the class (after an initial
      circumflex, if present) or escaped with a backslash.
-
+    </para>
+    <para>
      A character class matches a single character in the subject;
      the  character  must  be in the set of characters defined by
      the class, unless the first character in the class is a
@@ -823,7 +830,8 @@
      the set defined by the class. If a  circumflex  is  actually
      required  as  a  member  of  the class, ensure it is not the
      first character, or escape it with a backslash.
-
+    </para>
+    <para>
      For example, the character class [aeiou] matches  any  lower
      case vowel, while [^aeiou] matches any character that is not
      a lower case vowel. Note that a circumflex is  just  a
@@ -832,18 +840,21 @@
      assertion:  it  still  consumes a character from the subject
      string, and fails if the current pointer is at  the  end  of
      the string.
-
+    </para>
+    <para>
      When caseless matching  is  set,  any  letters  in  a  class
      represent  both their upper case and lower case versions, so
      for example, a caseless [aeiou] matches "A" as well as  "a",
      and  a caseless [^aeiou] does not match "A", whereas a
      caseful version would.
-
+    </para>
+    <para>
      The newline character is never treated in any special way in
      character  classes,  whatever the setting of the <link 
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> 
      or <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>  options is. A  
class  such  as  [^a]  will
      always match a newline.
-
+    </para>
+    <para>
      The minus (hyphen) character can be used to specify a  range
      of  characters  in  a  character  class.  For example, [d-m]
      matches any letter between d and m, inclusive.  If  a  minus
@@ -851,7 +862,8 @@
      backslash or appear in a position where it cannot be
      interpreted as indicating a range, typically as the first or last
      character in the class.
-     
+    </para>
+    <para>
      It is not possible to have the literal character "]" as  the
      end  character  of  a  range.  A  pattern such as [W-]46] is
      interpreted as a class of two characters ("W" and "-")
@@ -861,7 +873,8 @@
      interpreted as a single class containing a range followed by  two
      separate characters. The octal or hexadecimal representation
      of "]" can also be used to end a range.
-
+    </para>
+    <para>
      Ranges operate in ASCII collating sequence. They can also be
      used  for  characters  specified  numerically,  for  example
      [\000-\037]. If a range that includes letters is  used  when
@@ -870,7 +883,8 @@
      matched  caselessly,  and  if  character tables for the "fr"
      locale are in use, [\xc8-\xcb] matches accented E characters
      in both cases.
-
+    </para>
+    <para>
      The character types \d, \D, \s, \S,  \w,  and  \W  may  also
      appear  in  a  character  class, and add the characters that
      they match to the class. For example, [\dABCDEF] matches any
@@ -879,20 +893,21 @@
      restricted set of characters than the matching lower case type.
      For example, the class [^\W_] matches any letter  or  digit,
      but not underscore.
-
+    </para>
+    <para>
      All non-alphanumeric characters other than \,  -,  ^  (at  the
      start)  and  the  terminating ] are non-special in character
      classes, but it does no harm if they are escaped.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.verticalbar">
      <title>Vertical bar</title>
-     <literallayout>
+     <para>
      Vertical bar characters are  used  to  separate  alternative
      patterns. For example, the pattern
 
-       gilbert|sullivan
+       <literal>gilbert|sullivan</literal>
 
      matches either "gilbert" or "sullivan". Any number of alternatives
      may  appear,  and an empty alternative is permitted
@@ -902,56 +917,82 @@
      subpattern  (defined  below),  "succeeds" means matching the
      rest of the main pattern as well as the alternative  in  the
      subpattern.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.internal-options">
      <title>Internal option setting</title>
-     <literallayout>
-     The settings of <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> , 
-     <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> ,  
-     <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> ,
+     <para>
+     The settings of <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link>, 
+     <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>,  
+     <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>,
      and  <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>  can be changed 
from within the pattern by
      a sequence of Perl option letters enclosed between "(?"  and
      ")". The option letters are
 
-       i  for <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> 
-       m  for <link linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link> 
-       s  for <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> 
-       x  for <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> 
-
+     <table>
+      <title>Internal option letters</title>
+      <tgroup cols="2">
+       <tbody>
+        <row>
+         <entry><literal>i</literal></entry>
+         <entry>for <link 
linkend="pcre.pattern.modifiers">PCRE_CASELESS</link></entry>
+        </row>
+        <row>
+         <entry><literal>m</literal></entry>
+         <entry>for <link 
linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link></entry>
+        </row>
+        <row>
+         <entry><literal>s</literal></entry>
+         <entry>for <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link></entry>
+        </row>
+        <row>
+         <entry><literal>x</literal></entry>
+         <entry>for <link 
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link></entry>
+        </row>
+       </tbody>
+      </tgroup>
+     </table>
+    </para>
+    <para>
      For example, (?im) sets caseless, multiline matching. It  is
      also possible to unset these options by preceding the letter
      with a hyphen, and a combined setting and unsetting such  as
      (?im-sx),  which sets <link 
linkend="pcre.pattern.modifiers">PCRE_CASELESS</link>  and <link 
linkend="pcre.pattern.modifiers">PCRE_MULTILINE</link>  while
-     unsetting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>  and <link 
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link> , is also  permitted.
+     unsetting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>  and <link 
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>, is also  permitted.
      If  a  letter  appears both before and after the hyphen, the
      option is unset.
-
+    </para>
+    <para>
      The scope of these option changes depends on  where  in  the
      pattern  the  setting  occurs. For settings that are outside
      any subpattern (defined below), the effect is the same as if
      the  options were set or unset at the start of matching. The
      following patterns all behave in exactly the same way:
+    </para>
 
+     <literallayout>
        (?i)abc
        a(?i)bc
        ab(?i)c
        abc(?i)
+     </literallayout>
 
+    <para>
      which in turn is the same as compiling the pattern abc  with
      <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link> set.
      In  other words, such "top level" settings apply to the whole
      pattern  (unless  there  are  other changes  inside subpatterns).
      If there is more than one setting of the same option at top level,
      the rightmost  setting is used.
-
+    </para>
+    <para>
      If an option change occurs inside a subpattern,  the  effect
      is  different.  This is a change of behaviour in Perl 5.005.
      An option change inside a subpattern affects only that  part
      of the subpattern that follows it, so
 
-       (a(?i)b)c
+       <literal>(a(?i)b)c</literal>
 
      matches  abc  and  aBc  and  no  other   strings   (assuming
      <link linkend="pcre.pattern.modifiers">PCRE_CASELESS</link>   is  not used).  By 
this means, options can be
@@ -960,13 +1001,14 @@
      into subsequent branches within  the  same  subpattern.  For
      example,
 
-       (a(?i)b|c)
+       <literal>(a(?i)b|c)</literal>
 
      matches "ab", "aB", "c", and "C", even though when  matching
      "C" the first branch is abandoned before the option setting.
      This is because the effects of  option  settings  happen  at
      compile  time. There would be some very weird behaviour otherwise.
-
+    </para>
+    <para>
      The PCRE-specific options <link 
linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link>  and  
      <link linkend="pcre.pattern.modifiers">PCRE_EXTRA</link>   can
      be changed in the same way as the Perl-compatible options by
@@ -974,25 +1016,27 @@
      setting  is  special in that it must always occur earlier in
      the pattern than any of the additional features it turns on,
      even when it is at top level. It is best put at the start.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.subpatterns">
      <title>subpatterns</title>
-     <literallayout>
+     <para>
      Subpatterns are delimited by parentheses  (round  brackets),
      which can be nested.  Marking part of a pattern as a subpattern
      does two things:
-
+    </para>
+    <para>
      1. It localizes a set of alternatives. For example, the
      pattern
 
-       cat(aract|erpillar|)
+       <literal>cat(aract|erpillar|)</literal>
 
      matches one of the words "cat",  "cataract",  or  "caterpillar".
      Without  the  parentheses, it would match "cataract",
      "erpillar" or the empty string.
-
+    </para>
+    <para>
      2. It sets up the subpattern as a capturing  subpattern  (as
      defined  above).   When the whole pattern matches, that portion
      of the subject string that matched  the  subpattern  is
@@ -1001,15 +1045,17 @@
      <function>pcre_exec</function>. Opening parentheses are counted
      from  left  to right (starting from 1) to obtain the numbers of the
      capturing subpatterns.
-
+    </para>
+    <para>
      For example, if the string "the red king" is matched against
      the pattern
 
-       the ((red|white) (king|queen))
+       <literal>the ((red|white) (king|queen))</literal>
 
      the captured substrings are "red king", "red",  and  "king",
      and are numbered 1, 2, and 3.
-
+    </para>
+    <para>
      The fact that plain parentheses fulfil two functions is  not
      always  helpful.  There are often times when a grouping subpattern
      is required without a capturing requirement.  If  an
@@ -1019,49 +1065,57 @@
      if the string "the  white  queen"  is  matched  against  the
      pattern
 
-       the ((?:red|white) (king|queen))
+       <literal>the ((?:red|white) (king|queen))</literal>
 
      the captured substrings are "white queen" and  "queen",  and
      are  numbered  1  and 2. The maximum number of captured substrings
      is 99, and the maximum number  of  all  subpatterns,
      both capturing and non-capturing, is 200.
-
+    </para>
+    <para>
      As a  convenient  shorthand,  if  any  option  settings  are
      required  at  the  start  of a non-capturing subpattern, the
      option letters may appear between the "?" and the ":".  Thus
      the two patterns
+    </para>
 
+    <literallayout>
        (?i:saturday|sunday)
        (?:(?i)saturday|sunday)
+    </literallayout>
 
+    <para>
      match exactly the same set of strings.  Because  alternative
      branches  are  tried from left to right, and options are not
      reset until the end of the subpattern is reached, an  option
      setting  in  one  branch does affect subsequent branches, so
      the above patterns match "SUNDAY" as well as "Saturday".
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.repetition">
      <title>Repetition</title>
-     <literallayout>
+     <para>
      Repetition is specified by quantifiers, which can follow any
      of the following items:
 
-       a single character, possibly escaped
-       the . metacharacter
-       a character class
-       a back reference (see next section)
-       a parenthesized subpattern (unless it is  an  assertion  -
-     see below)
-
+      <itemizedlist>
+       <listitem><simpara>a single character, possibly escaped</simpara></listitem>
+       <listitem><simpara>the . metacharacter</simpara></listitem>
+       <listitem><simpara>a character class</simpara></listitem>
+       <listitem><simpara>a back reference (see next section)</simpara></listitem>
+       <listitem><simpara>a parenthesized subpattern (unless it is  an  assertion  -
+     see below)</simpara></listitem>
+      </itemizedlist>
+    </para>
+    <para>
      The general repetition quantifier specifies  a  minimum  and
      maximum  number  of  permitted  matches,  by  giving the two
      numbers in curly brackets (braces), separated  by  a  comma.
      The  numbers  must be less than 65536, and the first must be
      less than or equal to the second. For example:
 
-       z{2,4}
+       <literal>z{2,4}</literal>
 
      matches "zz", "zzz", or "zzzz". A closing brace on  its  own
      is not a special character. If the second number is omitted,
@@ -1069,42 +1123,63 @@
      second number and the comma are both omitted, the quantifier
      specifies an exact number of required matches. Thus
 
-       [aeiou]{3,}
+       <literal>[aeiou]{3,}</literal>
 
      matches at least 3 successive vowels,  but  may  match  many
      more, while
 
-       \d{8}
+       <literal>\d{8}</literal>
 
      matches exactly 8 digits.  An  opening  curly  bracket  that
      appears  in a position where a quantifier is not allowed, or
      one that does not match the syntax of a quantifier, is taken
      as  a literal character. For example, {,6} is not a quantifier,
      but a literal string of four characters.
-
+    </para>
+    <para>
      The quantifier {0} is permitted, causing the  expression  to
      behave  as  if the previous item and the quantifier were not
      present.
-
+    </para>
+    <para>
      For convenience (and  historical  compatibility)  the  three
      most common quantifiers have single-character abbreviations:
 
-       *    is equivalent to {0,}
-       +    is equivalent to {1,}
-       ?    is equivalent to {0,1}
-
+     <table>
+      <title>Single-character quantifiers</title>
+      <tgroup cols="2">
+       <tbody>
+        <row>
+         <entry><literal>*</literal></entry>
+         <entry>equivalent to <literal>{0,}</literal></entry>
+        </row>
+        <row>
+         <entry><literal>+</literal></entry>
+         <entry>equivalent to <literal>{1,}</literal></entry>
+        </row>
+        <row>
+         <entry><literal>?</literal></entry>
+         <entry>equivalent to <literal>{0,1}</literal></entry>
+        </row>
+       </tbody>
+      </tgroup>
+     </table>
+    </para>
+    <para>
      It is possible to construct infinite loops  by  following  a
      subpattern  that  can  match no characters with a quantifier
      that has no upper limit, for example:
 
-       (a?)*
-
+       <literal>(a?)*</literal>
+    </para>
+    <para>
      Earlier versions of Perl and PCRE used to give an  error  at
      compile  time  for such patterns. However, because there are
      cases where this  can  be  useful,  such  patterns  are  now
      accepted,  but  if  any repetition of the subpattern does in
      fact match no characters, the loop is forcibly broken.
-
+    </para>
+    <para>
      By default, the quantifiers  are  "greedy",  that  is,  they
      match  as much as possible (up to the maximum number of permitted
      times), without causing the rest of  the  pattern  to
@@ -1114,20 +1189,21 @@
      * and / characters may appear. An attempt to  match  C  comments
      by applying the pattern
 
-       /\*.*\*/
+       <literal>/\*.*\*/</literal>
 
      to the string
 
-       /* first command */  not comment  /* second comment */
+       <literal>/* first command */  not comment  /* second comment */</literal>
 
      fails, because it matches  the  entire  string  due  to  the
      greediness of the .*  item.
-
+    </para>
+    <para>
      However, if a quantifier is followed  by  a  question  mark,
      then it ceases to be greedy, and instead matches the minimum
      number of times possible, so the pattern
 
-       /\*.*?\*/
+       <literal>/\*.*?\*/</literal>
 
      does the right thing with the C comments. The meaning of the
      various  quantifiers is not otherwise changed, just the preferred
@@ -1136,22 +1212,25 @@
      Because it has two uses, it can sometimes appear doubled, as
      in
 
-       \d??\d
+       <literal>\d??\d</literal>
 
      which matches one digit by preference, but can match two  if
      that is the only way the rest of the pattern matches.
-
+    </para>
+    <para>
      If the <link linkend="pcre.pattern.modifiers">PCRE_UNGREEDY</link>  option is 
set (an option which  is  not
      available  in  Perl)  then the quantifiers are not greedy by
      default, but individual ones can be made greedy by following
      them  with  a  question mark. In other words, it inverts the
      default behaviour.
-
+    </para>
+    <para>
      When a parenthesized subpattern is quantified with a minimum
      repeat  count  that is greater than 1 or with a limited maximum,
      more store is required for the  compiled  pattern,  in
      proportion to the size of the minimum or maximum.
-
+    </para>
+    <para>
      If a pattern starts with .* or  .{0,}  and  the  <link 
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> 
      option (equivalent to Perl's /s) is set, thus allowing the .
      to match newlines, then the pattern is implicitly  anchored,
@@ -1163,11 +1242,12 @@
      no newlines, it is worth setting <link 
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>  when  the  pattern begins with .* 
in order to
      obtain this optimization, or
      alternatively using ^ to indicate anchoring explicitly.
-
+    </para>
+    <para>
      When a capturing subpattern is repeated, the value  captured
      is the substring that matched the final iteration. For example, after
 
-       (tweedle[dume]{3}\s*)+
+       <literal>(tweedle[dume]{3}\s*)+</literal>
 
      has matched "tweedledum tweedledee" the value  of  the  captured
      substring  is  "tweedledee".  However,  if  there are
@@ -1175,22 +1255,23 @@
      values  may  have been set in previous iterations. For example,
      after
      
-       /(a|(b))+/
+       <literal>/(a|(b))+/</literal>
 
      matches "aba" the value of the second captured substring  is
      "b".
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.back-references">
      <title>BACK REFERENCES</title>
-     <literallayout>
+     <para>
      Outside a character class, a backslash followed by  a  digit
      greater  than  0  (and  possibly  further  digits) is a back
      reference to a capturing subpattern  earlier  (i.e.  to  its
      left)  in  the  pattern,  provided there have been that many
      previous capturing left parentheses.
-
+    </para>
+    <para>
      However, if the decimal number following  the  backslash  is
      less  than  10,  it is always taken as a back reference, and
      causes an error only if there are not  that  many  capturing
@@ -1199,29 +1280,31 @@
      the  reference  for  numbers  less  than 10. See the section
      entitled "Backslash" above for further details of  the  handling
      of digits following a backslash.
-
+    </para>
+    <para>
      A back reference matches whatever actually matched the  capturing
      subpattern in the current subject string, rather than
      anything matching the subpattern itself. So the pattern
 
-       (sens|respons)e and \1ibility
+       <literal>(sens|respons)e and \1ibility</literal>
 
      matches "sense and sensibility" and "response and  responsibility",
      but  not  "sense  and  responsibility". If caseful
      matching is in force at the time of the back reference, then
      the case of letters is relevant. For example,
 
-       ((?i)rah)\s+\1
+       <literal>((?i)rah)\s+\1</literal>
 
      matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even
      though  the  original  capturing subpattern is matched caselessly.
-
+    </para>
+    <para>
      There may be more than one back reference to the  same  subpattern.
      If  a  subpattern  has not actually been used in a
      particular match, then any  back  references  to  it  always
      fail. For example, the pattern
 
-       (a|(bc))\2
+       <literal>(a|(bc))\2</literal>
 
      always fails if it starts to match  "a"  rather  than  "bc".
      Because  there  may  be up to 99 back references, all digits
@@ -1230,13 +1313,14 @@
      character, then some delimiter must be used to terminate the
      back reference. If the <link 
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>  option is set, this can
      be whitespace.  Otherwise an empty comment can be used.
-
+    </para>
+    <para>
      A back reference that occurs inside the parentheses to which
      it  refers  fails when the subpattern is first used, so, for
      example, (a\1) never matches.  However, such references  can
      be useful inside repeated subpatterns. For example, the pattern
 
-       (a|b\1)+
+       <literal>(a|b\1)+</literal>
 
      matches any number of "a"s and also "aba", "ababaa" etc.  At
      each iteration of the subpattern, the back reference matches
@@ -1245,12 +1329,12 @@
      that the first iteration does not need  to  match  the  back
      reference.  This  can  be  done using alternation, as in the
      example above, or by a quantifier with a minimum of zero.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.assertions">
      <title>Assertions</title>
-     <literallayout>
+     <para>
      An assertion is  a  test  on  the  characters  following  or
      preceding  the current matching point that does not actually
      consume any characters. The simple assertions coded  as  \b,
@@ -1258,34 +1342,36 @@
      assertions are coded as  subpatterns.  There  are  two
      kinds:  those that look ahead of the current position in the
      subject string, and those that look behind it.
-
+    </para>
+    <para>
      An assertion subpattern is matched in the normal way, except
      that  it  does not cause the current matching position to be
      changed. Lookahead assertions start with  (?=  for  positive
      assertions and (?! for negative assertions. For example,
 
-       \w+(?=;)
+       <literal>\w+(?=;)</literal>
 
      matches a word followed by a semicolon, but does not include
      the semicolon in the match, and
 
-       foo(?!bar)
+       <literal>foo(?!bar)</literal>
 
      matches any occurrence of "foo"  that  is  not  followed  by
      "bar". Note that the apparently similar pattern
 
-       (?!foo)bar
+       <literal>(?!foo)bar</literal>
 
      does not find an occurrence of "bar"  that  is  preceded  by
      something other than "foo"; it finds any occurrence of "bar"
      whatsoever, because the assertion  (?!foo)  is  always  &true;
      when  the  next  three  characters  are  "bar". A lookbehind
      assertion is needed to achieve this effect.
-     
+    </para>
+    <para>
      Lookbehind assertions start with (?&lt;=  for  positive  assertions
      and (?&lt;! for negative assertions. For example,
 
-       (?&lt;!foo)bar
+       <literal>(?&lt;!foo)bar</literal>
 
      does find an occurrence of "bar" that  is  not  preceded  by
      "foo". The contents of a lookbehind assertion are restricted
@@ -1293,11 +1379,11 @@
      length.  However, if there are several alternatives, they do
      not all have to have the same fixed length. Thus
 
-       (?&lt;=bullock|donkey)
+       <literal>(?&lt;=bullock|donkey)</literal>
 
      is permitted, but
 
-       (?&lt;!dogs?|cats?)
+       <literal>(?&lt;!dogs?|cats?)</literal>
 
      causes an error at compile time. Branches  that  match  different
      length strings are permitted only at the top level of
@@ -1305,13 +1391,13 @@
      Perl  5.005,  which  requires all branches to match the same
      length of string. An assertion such as
 
-       (?&lt;=ab(c|de))
+       <literal>(?&lt;=ab(c|de))</literal>
 
      is not permitted, because its single  top-level  branch  can
      match two different lengths, but it is acceptable if rewritten
      to use two top-level branches:
 
-       (?&lt;=abc|abde)
+       <literal>(?&lt;=abc|abde)</literal>
 
      The implementation of lookbehind  assertions  is,  for  each
      alternative,  to  temporarily move the current position back
@@ -1321,11 +1407,12 @@
      once-only  subpatterns can be particularly useful for matching
      at the ends of strings; an example is given at  the  end
      of the section on once-only subpatterns.
-
+    </para>
+    <para>
      Several assertions (of any sort) may  occur  in  succession.
      For example,
 
-       (?&lt;=\d{3})(?&lt;!999)foo
+       <literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>
 
      matches "foo" preceded by three digits that are  not  "999".
      Notice  that each of the assertions is applied independently
@@ -1337,25 +1424,28 @@
      of  which  are  not  "999".  For  example,  it doesn't match
      "123abcfoo". A pattern to do that is
 
-       (?&lt;=\d{3}...)(?&lt;!999)foo
-
+       <literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>
+    </para>
+    <para>
      This time the first assertion looks  at  the  preceding  six
      characters,  checking  that  the first three are digits, and
      then the second assertion checks that  the  preceding  three
      characters are not "999".
-
+    </para>
+    <para>
      Assertions can be nested in any combination. For example,
 
-       (?&lt;=(?&lt;!foo)bar)baz
+       <literal>(?&lt;=(?&lt;!foo)bar)baz</literal>
 
      matches an occurrence of "baz" that  is  preceded  by  "bar"
      which in turn is not preceded by "foo", while
 
-       (?&lt;=\d{3}(?!999)...)foo
+       <literal>(?&lt;=\d{3}(?!999)...)foo</literal>
 
      is another pattern which matches  "foo"  preceded  by  three
      digits and any three characters that are not "999".
-
+    </para>
+    <para>
      Assertion subpatterns are not capturing subpatterns, and may
      not  be  repeated,  because  it makes no sense to assert the
      same thing several times. If any kind of assertion  contains
@@ -1364,15 +1454,16 @@
      pattern.   However,  substring capturing is carried out only
      for positive assertions, because it does not make sense  for
      negative assertions.
-
+    </para>
+    <para>
      Assertions count towards the maximum  of  200  parenthesized
      subpatterns.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.onlyonce">
      <title>Once-only subpatterns</title>
-     <literallayout>
+     <para>
      With both maximizing and minimizing repetition,  failure  of
      what  follows  normally  causes  the repeated item to be
      re-evaluated to see if a different number of repeats allows the
@@ -1381,12 +1472,14 @@
      to  cause  it fail earlier than it otherwise might, when the
      author of the pattern knows there is no  point  in  carrying
      on.
-
+    </para>
+    <para>
      Consider, for example, the pattern \d+foo  when  applied  to
      the subject line
 
-       123456bar
-
+       <literal>123456bar</literal>
+    </para>
+    <para>
      After matching all 6 digits and then failing to match "foo",
      the normal action of the matcher is to try again with only 5
      digits matching the \d+ item, and then with 4,  and  so  on,
@@ -1397,40 +1490,45 @@
      the  first  time.  The  notation  is another kind of special
      parenthesis, starting with (?&gt; as in this example:
 
-       (?&gt;\d+)bar
-
+       <literal>(?&gt;\d+)bar</literal>
+    </para>
+    <para>
      This kind of parenthesis "locks up" the  part of the pattern
      it  contains once it has matched, and a failure further into
      the pattern is prevented from backtracking  into  it.
      Backtracking  past  it to previous items, however, works as normal.
-
+    </para>
+    <para>
      An alternative description is that a subpattern of this type
      matches  the  string  of  characters that an identical standalone
      pattern would match, if anchored at the current point
      in the subject string.
-
+    </para>
+    <para>
      Once-only subpatterns are not capturing subpatterns.  Simple
      cases  such as the above example can be thought of as a maximizing
      repeat that must  swallow  everything  it  can.  So,
      while both \d+ and \d+? are prepared to adjust the number of
      digits they match in order to make the rest of  the  pattern
      match, (?&gt;\d+) can only match an entire sequence of digits.
-
+    </para>
+    <para>
      This construction can of course contain arbitrarily  complicated
      subpatterns, and it can be nested.
-
+    </para>
+    <para>
      Once-only subpatterns can be used in conjunction with
      look-behind  assertions  to specify efficient matching at the end
      of the subject string. Consider a simple pattern such as
 
-       abcd$
+       <literal>abcd$</literal>
 
      when applied to a long string which does not match.  Because
      matching  proceeds  from  left  to right, PCRE will look for
      each "a" in the subject and then see if what follows matches
      the rest of the pattern. If the pattern is specified as
 
-       ^.*abcd$
+       <literal>^.*abcd$</literal>
 
      then the initial .* matches the entire string at first,  but
      when  this  fails  (because  there  is no following "a"), it
@@ -1439,28 +1537,29 @@
      for "a" covers the entire string, from right to left, so  we
      are no better off. However, if the pattern is written as
 
-       ^(?>.*)(?&lt;=abcd)
+       <literal>^(?>.*)(?&lt;=abcd)</literal>
 
      then there can be no backtracking for the .*  item;  it  can
      match  only  the  entire  string.  The subsequent lookbehind
      assertion does a single test on the last four characters. If
      it  fails,  the  match  fails immediately. For long strings,
      this approach makes a significant difference to the processing time.
-
+    </para>
+    <para>
      When a pattern contains an unlimited repeat inside a subpattern
      that can itself be repeated an unlimited number of
      times, the use of a once-only subpattern is the only way  to
      avoid  some  failing matches taking a very long time indeed.
      The pattern
 
-       (\D+|&lt;\d+>)*[!?]
+       <literal>(\D+|&lt;\d+>)*[!?]</literal>
 
      matches an unlimited number of substrings that  either  consist
      of  non-digits,  or digits enclosed in &lt;>, followed by
      either ! or ?. When it matches, it runs quickly. However, if
      it is applied to
 
-       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+       <literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
 
      it takes a long  time  before  reporting  failure.  This  is
      because the string can be divided between the two repeats in
@@ -1472,29 +1571,33 @@
      match,  and  fail early if it is not present in the string.)
      If the pattern is changed to
 
-       ((?>\D+)|&lt;\d+>)*[!?]
+       <literal>((?>\D+)|&lt;\d+>)*[!?]</literal>
 
      sequences of non-digits cannot be broken, and  failure  happens quickly.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.conditional">
      <title>Conditional subpatterns</title>
-     <literallayout>
+     <para>
      It is possible to cause the matching process to obey a  subpattern 
      conditionally  or to choose between two alternative
      subpatterns, depending on the result  of  an  assertion,  or
      whether  a previous capturing subpattern matched or not. The
      two possible forms of conditional subpattern are
+    </para>
 
+    <literallayout>
        (?(condition)yes-pattern)
        (?(condition)yes-pattern|no-pattern)
-
+    </literallayout>
+    <para>
      If the condition is satisfied, the yes-pattern is used; otherwise
      the  no-pattern  (if  present) is used. If there are
      more than two alternatives in the subpattern, a compile-time
      error occurs.
-
+    </para>
+    <para>
      There are two kinds of condition. If the  text  between  the
      parentheses  consists  of  a  sequence  of  digits, then the
      condition is satisfied if the capturing subpattern  of  that
@@ -1503,8 +1606,9 @@
      more  readable  (assume  the  <link 
linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>   option)  and to
      divide it into three parts for ease of discussion:
 
-       ( \( )?    [^()]+    (?(1) \) )
-
+       <literal>( \( )?    [^()]+    (?(1) \) )</literal>
+    </para>
+    <para>
      The first part matches an optional opening parenthesis,  and
      if  that character is present, sets it as the first captured
      substring. The second part matches one  or  more  characters
@@ -1517,16 +1621,20 @@
      subpattern  matches  nothing.  In  other words, this pattern
      matches a sequence of non-parentheses,  optionally  enclosed
      in parentheses.
-
+    </para>
+    <para>
      If the condition is not a sequence of digits, it must be  an
      assertion.  This  may be a positive or negative lookahead or
      lookbehind assertion. Consider this pattern, again  containing
      non-significant  white space, and with the two alternatives on
      the second line:
+    </para>
 
+    <literallayout>
        (?(?=[^a-z]*[a-z])
        \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
-
+    </literallayout>
+    <para>
      The condition is a positive lookahead assertion that matches
      an optional sequence of non-letters followed by a letter. In
      other words, it tests for  the  presence  of  at  least  one
@@ -1535,26 +1643,27 @@
      matched  against the second. This pattern matches strings in
      one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
      letters and dd are digits.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.comments">
      <title>Comments</title>
-     <literallayout>
+     <para>
      The  sequence  (?#  marks  the  start  of  a  comment  which
      continues   up  to  the  next  closing  parenthesis.  Nested
      parentheses are not permitted. The characters that make up a
      comment play no part in the pattern matching at all.
-
+    </para>
+    <para>
      If the <link linkend="pcre.pattern.modifiers">PCRE_EXTENDED</link>  option is 
set, an unescaped # character
      outside  a character class introduces a comment that
      continues up to the next newline character in the pattern.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.recursive">
      <title>Recursive patterns</title>
-     <literallayout>
+     <para>
      Consider the problem of matching a  string  in  parentheses,
      allowing  for  unlimited nested parentheses. Without the use
      of recursion, the best that can be done is to use a  pattern
@@ -1568,41 +1677,43 @@
      option is set so that white space is 
      ignored):
 
-       \( ( (?>[^()]+) | (?R) )* \)
-
+       <literal>\( ( (?>[^()]+) | (?R) )* \)</literal>
+    </para>
+    <para>
      First it matches an opening parenthesis. Then it matches any
      number  of substrings which can either be a sequence of
      non-parentheses, or a recursive  match  of  the  pattern  itself
      (i.e. a correctly parenthesized substring). Finally there is
      a closing parenthesis.
-
+    </para>
+    <para>
      This particular example pattern  contains  nested  unlimited
      repeats, and so the use of a once-only subpattern for matching
      strings of non-parentheses is  important  when  applying
      the  pattern to strings that do not match. For example, when
      it is applied to
 
-       (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
+       <literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
 
      it yields "no match" quickly. However, if a  once-only  subpattern
      is  not  used,  the match runs for a very long time
      indeed because there are so many different ways the + and  *
      repeats  can carve up the subject, and all have to be tested
      before failure can be reported.
-
+    </para>
+    <para>
      The values set for any capturing subpatterns are those  from
      the outermost level of the recursion at which the subpattern
      value is set. If the pattern above is matched against
 
-       (ab(cd)ef)
+       <literal>(ab(cd)ef)</literal>
 
      the value for the capturing parentheses is  "ef",  which  is
      the  last  value  taken  on  at the top level. If additional
      parentheses are added, giving
 
-       \( ( ( (?>[^()]+) | (?R) )* ) \)
-          ^                        ^
-          ^                        ^ then the string they capture
+       <literal>\( <emphasis>(</emphasis> ( (?>[^()]+) | (?R) )* 
<emphasis>)</emphasis> \)</literal>
+     then the string they capture
      is "ab(cd)ef", the contents of the top level parentheses. If
      there are more than 15 capturing parentheses in  a  pattern,
      PCRE  has  to  obtain  extra  memory  to store data during a
@@ -1611,12 +1722,12 @@
      saves data for the first 15 capturing parentheses  only,  as
      there is no way to give an out-of-memory error from within a
      recursion.
-     </literallayout>
+     </para>
     </refsect2>
 
     <refsect2 id="regexp.reference.performances">
      <title>Performances</title>
-     <literallayout>
+     <para>
      Certain items that may appear in patterns are more efficient
      than  others.  It is more efficient to use a character class
      like [aeiou] than a set of alternatives such as (a|e|i|o|u).
@@ -1624,7 +1735,8 @@
      required behaviour is usually the  most  efficient.  Jeffrey
      Friedl's  book contains a lot of discussion about optimizing
      regular expressions for efficient performance.
-
+    </para>
+    <para>
      When a pattern begins with .* and the <link 
linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>  option  is
      set,  the  pattern  is implicitly anchored by PCRE, since it
      can match only at the start of a subject string. However, if
@@ -1634,25 +1746,28 @@
      match from the character immediately following one  of  them
      instead of from the very start. For example, the pattern
 
-       (.*) second
+       <literal>(.*) second</literal>
 
      matches the subject "first\nand second" (where \n stands for
      a newline character) with the first captured substring being
      "and". In order to do this, PCRE  has  to  retry  the  match
      starting after every newline in the subject.
-
+    </para>
+    <para>
      If you are using such a pattern with subject strings that do
      not  contain  newlines,  the best performance is obtained by
-     setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link> , or starting 
the  pattern  with  ^.*  to
+     setting <link linkend="pcre.pattern.modifiers">PCRE_DOTALL</link>, or starting 
the  pattern  with  ^.*  to
      indicate  explicit anchoring. That saves PCRE from having to
      scan along the subject looking for a newline to restart at.
-
+    </para>
+    <para>
      Beware of patterns that contain nested  indefinite  repeats.
      These  can  take a long time to run when applied to a string
      that does not match. Consider the pattern fragment
 
-       (a+)*
-
+       <literal>(a+)*</literal>
+    </para>
+    <para>
      This can match "aaaa" in 33 different ways, and this  number
      increases  very  rapidly  as  the string gets longer. (The *
      repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of
@@ -1661,11 +1776,12 @@
      that  the entire match is going to fail, PCRE has in principle
      to try every possible variation, and this  can  take  an
      extremely long time.
-
+    </para>
+    <para>
      An optimization catches some of the more simple  cases  such
      as
 
-       (a+)*b
+       <literal>(a+)*b</literal>
 
      where a literal character follows. Before embarking  on  the
      standard matching procedure, PCRE checks that there is a "b"
@@ -1674,13 +1790,13 @@
      literal this optimization cannot be used. You  can  see  the
      difference by comparing the behaviour of
 
-       (a+)*\d
+       <literal>(a+)*\d</literal>
 
      with the pattern above. The former gives  a  failure  almost
      instantly  when  applied  to a whole line of "a" characters,
      whereas the latter takes an appreciable  time  with  strings
      longer than about 20 characters.
-     </literallayout>
+     </para>
     </refsect2>
    </refsect1>
   </refentry>

[PHP-DOC] cvs: phpdoc /en/reference/pcre/functions pcre.pattern.syntax.xml

Reply via email to