Author: larry
Date: Wed Feb 20 10:49:59 2008
New Revision: 14513
Modified:
doc/trunk/design/syn/S05.pod
Log:
Clarification of ** semantics under :sigspace and :ratchet
Allow quantification on separator atom for common \s+ case
Clarify that the <file> examples are ignoring whitespace issues
Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod (original)
+++ doc/trunk/design/syn/S05.pod Wed Feb 20 10:49:59 2008
@@ -14,9 +14,9 @@
Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
Larry Wall <[EMAIL PROTECTED]>
Date: 24 Jun 2002
- Last Modified: 30 Jan 2008
+ Last Modified: 20 Feb 2008
Number: 5
- Version: 72
+ Version: 73
This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them I<regex> rather than "regular
@@ -336,10 +336,10 @@
New modifiers specify Unicode level:
- m:bytes / .**{2} / # match two bytes
- m:codes / .**{2} / # match two codepoints
- m:graphs / .**{2} / # match two language-independent graphemes
- m:chars / .**{2} / # match two characters at current max level
+ m:bytes / .**2 / # match two bytes
+ m:codes / .**2 / # match two codepoints
+ m:graphs / .**2 / # match two language-independent graphemes
+ m:chars / .**2 / # match two characters at current max level
There are corresponding pragmas to default to these levels. Note that
the C<:chars> modifier is always redundant because dot always matches
@@ -361,7 +361,7 @@
is equivalant to the PerlĀ 6 syntax:
- m/ :i ^^ [ <[a..z]> || \d ]**{1..2} <before \s> /
+ m/ :i ^^ [ <[a..z]> || \d ] ** 1..2 <before \s> /
=item *
@@ -733,8 +733,11 @@
The general repetition specifier is now C<**> for maximal matching,
with a corresponding C<**?> for minimal matching. (All such quantifier
modifiers now go directly after the C<**>.) Space is allowed on either
-side of the complete quantifier. The next token will determine what
-kind of repetition is desired:
+side of the complete quantifier. This space is considered significant
+under C<:sigspace>, and will be distributed as a call to <.ws> between
+all the elements of the match but not on either end.
+
+The next token will determine what kind of repetition is desired:
If the next thing is an integer, then it is parsed as either as an exact
count or a range:
@@ -758,20 +761,19 @@
The closure form is always considered procedural, so the item it is
modifying is never considered part of the longest token.
-If you supply any other atom (which may not be quantified), it is
+If you supply any other atom (which may be quantified), it is
interpreted as a separator (such as an infix operator), and the
initial item is quantified by the number of times the separator is
seen between items:
- <alt> ** '|' # repetition controlled by presence of separator
- <addend> ** <addop> # repetition controlled by presence of separator
- <item> ** [ \!?'==' ] # repetition controlled by presence of separator
+ <alt> ** '|' # repetition controlled by presence of character
+ <addend> ** <addop> # repetition controlled by presence of subrule
+ <item> ** [ \!?'==' ] # repetition controlled by presence of operator
+ <file>**\h+ # repetition controlled by presence of whitespace
A successful match of such a quantifier always ends "in the middle",
that is, after the initial item but before the next separator.
-(The separator never matches independently of the next item; if the
-separator matches but the next item fails, it backtracks all the way
-back through the separator.) Therefore
+Therefore
/ <ident> ** ',' /
@@ -791,6 +793,36 @@
. ** <?same> # match sequence of identical characters
+The separator never matches independently of the next item; if the
+separator matches but the next item fails, it backtracks all the way
+back through the separator. Likewise, this matching of the separator
+does not count as "progress" under C<:ratchet> semantics unless the
+next item succeeds.
+
+When significant space is used under C<:sigspace> with the separator
+form, it applies on both sides of the separator, so
+
+ mm/<element> ** ','/
+ mm/<element>** ','/
+ mm/<element> **','/
+
+all allow whitespace around the separator like this:
+
+ / <element>[<.ws>','<.ws><element>]* /
+
+while
+
+ mm/<element>**','/
+
+excludes all significant whitespace:
+
+ / <element>[','<element>]* /
+
+Of course, you can always match whitespace explicitly if necessary, so to
+allow whitespace after the comma but not before, you can say:
+
+ / <element>**[','\s*] /
+
=item *
C<< <...> >> are now extensible metasyntax delimiters or I<assertions>
@@ -2636,6 +2668,11 @@
$to = $<file>[1];
}
+(Note, for clarity we are ignoring whitespace subtleties here--the
+normal sigspace rules would require space only between alphanumeric
+characters, which is wrong. Assume that our file subrule requires a
+real boundary at that point using C<< <!before \S> >> or some such.)
+
Likewise, with a quantified subrule:
if mm/ mv <file>**{2} / {