Author: larry
Date: Thu Sep 13 08:59:19 2007
New Revision: 14457
Modified:
doc/trunk/design/syn/S05.pod
Log:
Suggestioned clarifications from lots of folks++
Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod (original)
+++ doc/trunk/design/syn/S05.pod Thu Sep 13 08:59:19 2007
@@ -14,9 +14,9 @@
Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
Larry Wall <[EMAIL PROTECTED]>
Date: 24 Jun 2002
- Last Modified: 11 Sep 2007
+ Last Modified: 13 Sep 2007
Number: 5
- Version: 65
+ Version: 66
This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them I<regex> rather than "regular
@@ -44,9 +44,6 @@
By the way, unlike in PerlĀ 5, the numbered capture variables now
start at C<$0> instead of C<$1>. See below.
-During the execution of a match, the current match state is stored in a
-C<$_> variable lexically scoped to an appropriate portion of the match.
-This is transparent to the user for simple matches.
=head1 Unchanged syntactic features
@@ -333,11 +330,11 @@
If followed by an C<x>, it means repetition. Use C<:x(4)> for the
general form. So
- s:4x [ (<.ident>) = (\N+) $$] [$0 => $1];
+ s:4x [ (<.ident>) = (\N+) $$] = "$0 => $1";
is the same as:
- s:x(4) [ (<.ident>) = (\N+) $$] [$0 => $1];
+ s:x(4) [ (<.ident>) = (\N+) $$] = "$0 => $1";
which is almost the same as:
@@ -407,7 +404,7 @@
The new C<:rw> modifier causes this regex to I<claim> the current
string for modification rather than assuming copy-on-write semantics.
-All the bindings in C<$/> become lvalues into the string, such
+All the captures in C<$/> become lvalues into the string, such
that if you modify, say, C<$1>, the original string is modified in
that location, and the positions of all the other fields modified
accordingly (whatever that means). In the absence of this modifier
@@ -662,20 +659,32 @@
\s+ { print "but does contain whitespace\n" }
/
-An B<explicit> reduce from a regex closure binds the I<result object>
+An B<explicit> reduction using the C<make> function sets the I<result object>
for this match:
- / (\d) { reduce $0.sqrt } Remainder /;
+ / (\d) { make $0.sqrt } Remainder /;
This has the effect of capturing the square root of the numified string,
instead of the string. The C<Remainder> part is matched but is not returned
-unless the first reduce is later overridden by another reduce.
+unless the first C<make> is later overridden by another C<make>.
-These closures are invoked with a topic (C<$_>) of the current match state.
-Within a closure, the instantaneous position within the search is
-denoted by the C<.pos> method on that object. As with all string positions,
-you must not treat it as a number unless you are very careful about
-which units you are dealing with.
+These closures are invoked with a topic (C<$_>) of the current match
+state (a C<Cursor> object). Within a closure, the instantaneous
+position within the search is denoted by the C<.pos> method on
+that object. As with all string positions, you must not treat it
+as a number unless you are very careful about which units you are
+dealing with.
+
+The C<Cursor> object can also return the original item that we are
+matching against; this is available from the C<._> method, named to
+remind you that it probably came from the user's C<$_> variable.
+(But that may well be off in some other scope when indirect rules
+are called, so we mustn't rely on the user's lexical scope.)
+
+The closure is also guaranteed to start with a C<$/> C<Match> object
+representing the match so far. However, if the closure does its own
+internal matching, its C<$/> variable will be rebound to the result
+of I<that> match until the end of the embedded closure.
=item *
@@ -747,6 +756,11 @@
foo,
foo,bar,
+It is legal for the separator to be zero-width as long as the pattern on
+the left progresses on each iteration:
+
+ . ** <?same> # match sequence of identical characters
+
=item *
C<< <...> >> are now extensible metasyntax delimiters or I<assertions>
@@ -784,7 +798,7 @@
C<< <$var> >>. (See assertions below.) This form does not capture,
and it fails if C<$var> is tainted.
-However, a variable used as the left side of a binding or submatch
+However, a variable used as the left side of an alias or submatch
operator is not used for matching.
$x = <ident>
@@ -795,13 +809,41 @@
"$0" ~~ <ident>
-It is non-sensical to bind to something that is not a variable:
+On the other hand, it is non-sensical to alias to something that is
+not a variable:
"$0" = <ident> # ERROR
+ $0 = <ident> # okay
+ $x = <ident> # okay, temporary capture
+ $<x> = <ident> # okay, persistent capture
+ <x=ident> # same thing
+
+Variables declared in capture aliases are lexically scoped to the
+rest of the regex. You should not confuse this use of C<=> with
+either ordinary assignment or ordinary binding. You should read
+the C<=> more like the pseudoassignment of a declarator than like
+normal assignment. It's more like the ordinary C<:=> operator,
+since at the level regexes work, strings are immutable, so captures
+are really just precomputed substr values. Nevertheless, when you
+eventually use the values independently, the substr may be copied,
+and then it's more like it was an assignment originally.
+
+Capture variables of the form C<< $<ident> >> may persist beyond
+the lexical scope; if the match succeeds they are remembered in the
+C<Match> object's hash, with a key corresponding to the variable name's
+identifier. Likewise bound numeric variables persist as C<$0>, etc.
+
+The capture performed by C<=> creates a new lexical variable if it does
+not already exist in the current lexical scope. To capture to an outer
+lexical variable you must supply an C<OUTER::> as part of the name,
+or perform the assignment from within a closure.
+
+ $x = [...] # capture to our own lexical $x
+ $OUTER::x = [...] # capture to existing lexical $x
+ [...] -> $tmp { let $x = $tmp } # capture to existing lexical $x
-Variables used in bindings are lexically scoped to the rest of the regex.
-If the match succeeds they are remembered in the C<Match> object's hash,
-with a key corresponding to the variable name without the sigil.
+Note however that C<let> (and C<temp>) are not guaranteed to be thread
+safe on shared variables, so don't do that.
=item *
@@ -1010,7 +1052,7 @@
is just shorthand for
- $foo=<bar>
+ $<foo> = <bar>
If the first character after the identifier is whitespace, the
subsequent text (following any whitespace) is passed as a regex, so:
@@ -1043,17 +1085,18 @@
The special named assertions include:
- / <before pattern> / # was /(?=pattern)/
- / <after pattern> / # was /(?<=pattern)/
+ / <?before pattern> / # lookahead
+ / <?after pattern> / # lookbehind
+
+ / <?same> / # true between two identical characters
- / <sp> / # match the SPACE character (U+0020)
- / <ws> / # match "whitespace":
- # \s+ if it's between two \w characters,
- # \s* otherwise
-
- / <at($pos)> / # match only at a particular StrPos
- # short for <?{ .pos === $pos }>
- # (considered declarative until $pos changes)
+ / <.ws> / # match "whitespace":
+ # \s+ if it's between two \w characters,
+ # \s* otherwise
+
+ / <?at($pos)> / # match only at a particular StrPos
+ # short for <?{ .pos === $pos }>
+ # (considered declarative until $pos changes)
The C<after> assertion implements lookbehind by reversing the syntax
tree and looking for things in the opposite order going to the left.
@@ -1082,7 +1125,7 @@
string is never matched literally.
Such an assertion is not captured. (No assertion with leading punctuation
-is captured by default.) You may always bind it explicitly, of course.
+is captured by default.) You may always capture it explicitly, of course.
A subrule is considered declarative to the extent that the front of it
is declarative, and to the extent that the variable doesn't change.
@@ -1509,8 +1552,8 @@
The two cases can always be distinguished using C<m{...}> or C<rx{...}>:
- $var = m{pattern}; # Match regex immediately, assign result
- $var = rx{pattern}; # Assign regex expression itself
+ $match = m{pattern}; # Match regex immediately, assign result
+ $regex = rx{pattern}; # Assign regex expression itself
=item *
@@ -2003,11 +2046,11 @@
When used as a scalar, a C<Match> object evaluates to its underlying
result object. Usually this is just the entire match string, but
-you can override that by calling C<reduce> inside a regex:
+you can override that by calling C<make> inside a regex:
my $moose = $(m:{
<antler> <body>
- { reduce Moose.new( body => $body().attach($antler) ) }
+ { make Moose.new( body => $body().attach($antler) ) }
# match succeeds -- ignore the rest of the regex
});
@@ -2030,8 +2073,8 @@
This means that these two work the same:
- / <moose> { reduce $moose as Moose } /
- / <moose> { reduce $$moose as Moose } /
+ / <moose> { make $moose as Moose } /
+ / <moose> { make $$moose as Moose } /
=item *
@@ -2129,11 +2172,11 @@
Fortunately, when you just want to return a different result object instead
of the default C<Match> object, you may associate your return value with
-the current match state using the C<reduce> function, which works something
+the current match state using the C<make> function, which works something
like a C<return>, but doesn't clobber the match state:
$str ~~ / foo # Match 'foo'
- { reduce 'bar' } # But pretend we matched 'bar'
+ { make 'bar' } # But pretend we matched 'bar'
/;
say $(); # says 'bar'
@@ -2454,7 +2497,7 @@
# subrule subrule subrule
# __^__ _______^_____ __^__
# | | | | | |
- m/ <ident> $spaces = (\s*) <digit>+ /
+ m/ <ident> $<spaces>=(\s*) <digit>+ /
=item *
@@ -2496,7 +2539,7 @@
Note that it makes no difference whether a subrule is angle-bracketed
(C<< <ident> >>) or aliased internally (C<< <ident=name> >>) or aliased
-externally (C<< $ident = (<alpha>\w*) >>). The name's the thing.
+externally (C<< $<ident>=(<alpha>\w*) >>). The name's the thing.
=back
@@ -2541,7 +2584,7 @@
=item *
However, if a subrule is explicitly renamed (or aliased -- see L</Aliasing>),
-then only the I<final> name counts when deciding whether it is or isn't
+then only the I<new> name counts when deciding whether it is or isn't
repeated. For example:
if mm/ mv <file> <dir=file> / {
@@ -2601,7 +2644,7 @@
# ______/capturing parens\______
# | |
# | |
- mm/ $key = ( (<[A..E]>) (\d**{3..6}) (X?) ) /;
+ mm/ $<key>=( (<[A..E]>) (\d**{3..6}) (X?) ) /;
then the outer capturing parens no longer capture into the array of
C<$/> as unaliased parens would. Instead the aliased parens capture
@@ -2659,7 +2702,7 @@
# ___/non-capturing brackets\___
# | |
# | |
- mm/ $key = [ (<[A..E]>) (\d**{3..6}) (X?) ] /;
+ mm/ $<key>=[ (<[A..E]>) (\d**{3..6}) (X?) ] /;
then the corresponding C<< $/<key> >> Match object contains only the string
matched by the non-capturing brackets.
@@ -2833,7 +2876,7 @@
}
- if m/ mv \s+ $from=(\S+ \s+)* / {
+ if m/ mv \s+ $<from>=(\S+ \s+)* / {
# Quantified subpattern returns a list of Match objects,
# so $/<from> contains an array of Match
# objects, one for each successful match of the subpattern
@@ -2849,7 +2892,7 @@
brackets (as described in L<Named scalar aliases applied to
non-capturing brackets>). For example:
- "coffee fifo fumble" ~~ m/ $effs = [f <-[f]>**{1..2} \s*]+ /;
+ "coffee fifo fumble" ~~ m/ $<effs>=[f <-[f]>**{1..2} \s*]+ /;
say $<effs>; # prints "fee fifo fum"
@@ -2865,7 +2908,7 @@
An alias can also be specified using an array as the alias instead of a scalar.
For example:
- m/ mv \s+ @from = [(\S+) \s+]* <dir> /;
+ m/ mv \s+ @<from>=[(\S+) \s+]* <dir> /;
=item *
@@ -2877,8 +2920,8 @@
structurally different alternations (by enforcing array captures in all
branches):
- mm/ Mr?s? @names=<ident> W\. @names=<ident>
- | Mr?s? @names=<ident>
+ mm/ Mr?s? @<names>=<ident> W\. @<names>=<ident>
+ | Mr?s? @<names>=<ident>
/;
# Aliasing to @names means $/<names> is always
@@ -2891,8 +2934,8 @@
For convenience and consistency, C<< @<key> >> can also be used outside a
regex, as a shorthand for C<< @( $/<key> ) >>. That is:
- mm/ Mr?s? @names=<ident> W\. @names=<ident>
- | Mr?s? @names=<ident>
+ mm/ Mr?s? @<names>=<ident> W\. @<names>=<ident>
+ | Mr?s? @<names>=<ident>
/;
say @<names>;
@@ -2903,18 +2946,18 @@
brackets, it captures the substrings matched by each repetition of the
brackets into separate elements of the corresponding array. That is:
- mm/ mv $files=[ f.. \s* ]* /; # $/<files> assigned a single
- # Match object containing the
- # complete substring matched by
- # the full set of repetitions
- # of the non-capturing brackets
-
- mm/ mv @files=[ f.. \s* ]* /; # $/<files> assigned an array,
- # each element of which is a
- # Match object containing
- # the substring matched by Nth
- # repetition of the non-
- # capturing bracket match
+ mm/ mv $<files>=[ f.. \s* ]* /; # $/<files> assigned a single
+ # Match object containing the
+ # complete substring matched by
+ # the full set of repetitions
+ # of the non-capturing brackets
+
+ mm/ mv @<files>=[ f.. \s* ]* /; # $/<files> assigned an array,
+ # each element of which is a
+ # Match object containing
+ # the substring matched by Nth
+ # repetition of the non-
+ # capturing bracket match
=item *
@@ -2925,7 +2968,7 @@
an array alias on a subpattern flattens and collects all nested
subpattern captures within the aliased subpattern. For example:
- if mm/ $pairs=( (\w+) \: (\N+) )+ / {
+ if mm/ $<pairs>=( (\w+) \: (\N+) )+ / {
# Scalar alias, so $/<pairs> is assigned an array
# of Match objects, each of which has its own array
# of two subcaptures...
@@ -2937,7 +2980,7 @@
}
- if mm/ @pairs=( (\w+) \: (\N+) )+ / {
+ if mm/ @<pairs>=( (\w+) \: (\N+) )+ / {
# Array alias, so $/<pairs> is assigned an array
# of Match objects, each of which is flattened out of
# the two subcaptures within the subpattern
@@ -2957,7 +3000,7 @@
rule pair { (\w+) \: (\N+) \n }
- if mm/ $pairs=<pair>+ / {
+ if mm/ $<pairs>=<pair>+ / {
# Scalar alias, so $/<pairs> contains an array of
# Match objects, each of which is the result of the
# <pair> subrule call...
@@ -2969,7 +3012,7 @@
}
- if mm/ mv @pairs=<pair>+ / {
+ if mm/ mv @<pairs>=<pair>+ / {
# Array alias, so $/<pairs> contains an array of
# Match objects, all flattened down from the
# nested arrays inside the Match objects returned
@@ -3032,7 +3075,7 @@
An alias can also be specified using a hash as the alias variable,
instead of a scalar or an array. For example:
- m/ mv %location=( (<ident>) \: (\N+) )+ /;
+ m/ mv %<location>=( (<ident>) \: (\N+) )+ /;
=item *
@@ -3086,11 +3129,11 @@
Instead of using internal aliases like:
- m/ mv @files=<ident>+ $dir=<ident> /
+ m/ mv @<files>=<ident>+ $<dir>=<ident> /
the name of an ordinary variable can be used as an I<external> alias, like so:
- m/ mv @files=<ident>+ $dir=<ident> /
+ m/ mv @OUTER::files=<ident>+ $OUTER::dir=<ident> /
=item *
@@ -3243,20 +3286,20 @@
grammar Letter {
rule text { <greet> <body> <close> }
- rule greet { [Hi|Hey|Yo] $to=(\S+?) , $$}
+ rule greet { [Hi|Hey|Yo] $<to>=(\S+?) , $$}
rule body { <line>+? } # note: backtracks forwards via +?
- rule close { Later dude, $from=(.+) }
+ rule close { Later dude, $<from>=(.+) }
# etc.
}
grammar FormalLetter is Letter {
- rule greet { Dear $to=(\S+?) , $$}
+ rule greet { Dear $<to>=(\S+?) , $$}
- rule close { Yours sincerely, $from=(.+) }
+ rule close { Yours sincerely, $<from>=(.+) }
}
@@ -3552,4 +3595,30 @@
=back
+=head1 When C<$/> is valid
+
+To provide implementational freedom, the C<$/> variable is not
+guaranteed to be defined until the pattern reaches a sequence
+point that requires it (such as completing the match, or calling an
+embedded closure, or even evaluating a submatch that requires a Perl
+expression for its argument). Within regex code, C<$/> is officially
+undefined, and references to C<$0> or other capture variables may
+be compiled to produce the current value without reference to C<$/>.
+Likewise a reference to C<< $<foo> >> does not necessarily mean C<<
+$/<foo> >> within the regex proper. During the execution of a match,
+the current match state is likely to be stored in a C<$_> variable
+lexically scoped to an appropriate portion of the match, but that is
+not guaranteed to behave the same as the C<$/> object, because C<$/>
+is of type C<Match>, while the match state is of type C<Cursor>.
+(It really depends on the implementation of the pattern matching
+engine.)
+
+In any case this is all transparent to the user for simple matches;
+and outside of regex code (and inside closures within the regex)
+the C<$/> variable is guaranteed to represent the state of the match
+at that point. That is, normal Perl code can always depend on C<<
+$<foo> >> meaning C<< $/<foo> >>, and C<$0> meaning C<$/[0]>, whether
+that code is embedded in a closure within the regex or outside the
+regex after the match completes.
+
=for vim:set expandtab sw=4: