Author: larry Date: Wed Feb 20 10:49:59 2008 New Revision: 14513 Modified: doc/trunk/design/syn/S05.pod
Log: Clarification of ** semantics under :sigspace and :ratchet Allow quantification on separator atom for common \s+ case Clarify that the <file> examples are ignoring whitespace issues Modified: doc/trunk/design/syn/S05.pod ============================================================================== --- doc/trunk/design/syn/S05.pod (original) +++ doc/trunk/design/syn/S05.pod Wed Feb 20 10:49:59 2008 @@ -14,9 +14,9 @@ Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and Larry Wall <[EMAIL PROTECTED]> Date: 24 Jun 2002 - Last Modified: 30 Jan 2008 + Last Modified: 20 Feb 2008 Number: 5 - Version: 72 + Version: 73 This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them I<regex> rather than "regular @@ -336,10 +336,10 @@ New modifiers specify Unicode level: - m:bytes / .**{2} / # match two bytes - m:codes / .**{2} / # match two codepoints - m:graphs / .**{2} / # match two language-independent graphemes - m:chars / .**{2} / # match two characters at current max level + m:bytes / .**2 / # match two bytes + m:codes / .**2 / # match two codepoints + m:graphs / .**2 / # match two language-independent graphemes + m:chars / .**2 / # match two characters at current max level There are corresponding pragmas to default to these levels. Note that the C<:chars> modifier is always redundant because dot always matches @@ -361,7 +361,7 @@ is equivalant to the PerlĀ 6 syntax: - m/ :i ^^ [ <[a..z]> || \d ]**{1..2} <before \s> / + m/ :i ^^ [ <[a..z]> || \d ] ** 1..2 <before \s> / =item * @@ -733,8 +733,11 @@ The general repetition specifier is now C<**> for maximal matching, with a corresponding C<**?> for minimal matching. (All such quantifier modifiers now go directly after the C<**>.) Space is allowed on either -side of the complete quantifier. The next token will determine what -kind of repetition is desired: +side of the complete quantifier. This space is considered significant +under C<:sigspace>, and will be distributed as a call to <.ws> between +all the elements of the match but not on either end. + +The next token will determine what kind of repetition is desired: If the next thing is an integer, then it is parsed as either as an exact count or a range: @@ -758,20 +761,19 @@ The closure form is always considered procedural, so the item it is modifying is never considered part of the longest token. -If you supply any other atom (which may not be quantified), it is +If you supply any other atom (which may be quantified), it is interpreted as a separator (such as an infix operator), and the initial item is quantified by the number of times the separator is seen between items: - <alt> ** '|' # repetition controlled by presence of separator - <addend> ** <addop> # repetition controlled by presence of separator - <item> ** [ \!?'==' ] # repetition controlled by presence of separator + <alt> ** '|' # repetition controlled by presence of character + <addend> ** <addop> # repetition controlled by presence of subrule + <item> ** [ \!?'==' ] # repetition controlled by presence of operator + <file>**\h+ # repetition controlled by presence of whitespace A successful match of such a quantifier always ends "in the middle", that is, after the initial item but before the next separator. -(The separator never matches independently of the next item; if the -separator matches but the next item fails, it backtracks all the way -back through the separator.) Therefore +Therefore / <ident> ** ',' / @@ -791,6 +793,36 @@ . ** <?same> # match sequence of identical characters +The separator never matches independently of the next item; if the +separator matches but the next item fails, it backtracks all the way +back through the separator. Likewise, this matching of the separator +does not count as "progress" under C<:ratchet> semantics unless the +next item succeeds. + +When significant space is used under C<:sigspace> with the separator +form, it applies on both sides of the separator, so + + mm/<element> ** ','/ + mm/<element>** ','/ + mm/<element> **','/ + +all allow whitespace around the separator like this: + + / <element>[<.ws>','<.ws><element>]* / + +while + + mm/<element>**','/ + +excludes all significant whitespace: + + / <element>[','<element>]* / + +Of course, you can always match whitespace explicitly if necessary, so to +allow whitespace after the comma but not before, you can say: + + / <element>**[','\s*] / + =item * C<< <...> >> are now extensible metasyntax delimiters or I<assertions> @@ -2636,6 +2668,11 @@ $to = $<file>[1]; } +(Note, for clarity we are ignoring whitespace subtleties here--the +normal sigspace rules would require space only between alphanumeric +characters, which is wrong. Assume that our file subrule requires a +real boundary at that point using C<< <!before \S> >> or some such.) + Likewise, with a quantified subrule: if mm/ mv <file>**{2} / {