Author: larry Date: Thu Sep 13 08:59:19 2007 New Revision: 14457 Modified: doc/trunk/design/syn/S05.pod
Log: Suggestioned clarifications from lots of folks++ Modified: doc/trunk/design/syn/S05.pod ============================================================================== --- doc/trunk/design/syn/S05.pod (original) +++ doc/trunk/design/syn/S05.pod Thu Sep 13 08:59:19 2007 @@ -14,9 +14,9 @@ Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and Larry Wall <[EMAIL PROTECTED]> Date: 24 Jun 2002 - Last Modified: 11 Sep 2007 + Last Modified: 13 Sep 2007 Number: 5 - Version: 65 + Version: 66 This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them I<regex> rather than "regular @@ -44,9 +44,6 @@ By the way, unlike in PerlĀ 5, the numbered capture variables now start at C<$0> instead of C<$1>. See below. -During the execution of a match, the current match state is stored in a -C<$_> variable lexically scoped to an appropriate portion of the match. -This is transparent to the user for simple matches. =head1 Unchanged syntactic features @@ -333,11 +330,11 @@ If followed by an C<x>, it means repetition. Use C<:x(4)> for the general form. So - s:4x [ (<.ident>) = (\N+) $$] [$0 => $1]; + s:4x [ (<.ident>) = (\N+) $$] = "$0 => $1"; is the same as: - s:x(4) [ (<.ident>) = (\N+) $$] [$0 => $1]; + s:x(4) [ (<.ident>) = (\N+) $$] = "$0 => $1"; which is almost the same as: @@ -407,7 +404,7 @@ The new C<:rw> modifier causes this regex to I<claim> the current string for modification rather than assuming copy-on-write semantics. -All the bindings in C<$/> become lvalues into the string, such +All the captures in C<$/> become lvalues into the string, such that if you modify, say, C<$1>, the original string is modified in that location, and the positions of all the other fields modified accordingly (whatever that means). In the absence of this modifier @@ -662,20 +659,32 @@ \s+ { print "but does contain whitespace\n" } / -An B<explicit> reduce from a regex closure binds the I<result object> +An B<explicit> reduction using the C<make> function sets the I<result object> for this match: - / (\d) { reduce $0.sqrt } Remainder /; + / (\d) { make $0.sqrt } Remainder /; This has the effect of capturing the square root of the numified string, instead of the string. The C<Remainder> part is matched but is not returned -unless the first reduce is later overridden by another reduce. +unless the first C<make> is later overridden by another C<make>. -These closures are invoked with a topic (C<$_>) of the current match state. -Within a closure, the instantaneous position within the search is -denoted by the C<.pos> method on that object. As with all string positions, -you must not treat it as a number unless you are very careful about -which units you are dealing with. +These closures are invoked with a topic (C<$_>) of the current match +state (a C<Cursor> object). Within a closure, the instantaneous +position within the search is denoted by the C<.pos> method on +that object. As with all string positions, you must not treat it +as a number unless you are very careful about which units you are +dealing with. + +The C<Cursor> object can also return the original item that we are +matching against; this is available from the C<._> method, named to +remind you that it probably came from the user's C<$_> variable. +(But that may well be off in some other scope when indirect rules +are called, so we mustn't rely on the user's lexical scope.) + +The closure is also guaranteed to start with a C<$/> C<Match> object +representing the match so far. However, if the closure does its own +internal matching, its C<$/> variable will be rebound to the result +of I<that> match until the end of the embedded closure. =item * @@ -747,6 +756,11 @@ foo, foo,bar, +It is legal for the separator to be zero-width as long as the pattern on +the left progresses on each iteration: + + . ** <?same> # match sequence of identical characters + =item * C<< <...> >> are now extensible metasyntax delimiters or I<assertions> @@ -784,7 +798,7 @@ C<< <$var> >>. (See assertions below.) This form does not capture, and it fails if C<$var> is tainted. -However, a variable used as the left side of a binding or submatch +However, a variable used as the left side of an alias or submatch operator is not used for matching. $x = <ident> @@ -795,13 +809,41 @@ "$0" ~~ <ident> -It is non-sensical to bind to something that is not a variable: +On the other hand, it is non-sensical to alias to something that is +not a variable: "$0" = <ident> # ERROR + $0 = <ident> # okay + $x = <ident> # okay, temporary capture + $<x> = <ident> # okay, persistent capture + <x=ident> # same thing + +Variables declared in capture aliases are lexically scoped to the +rest of the regex. You should not confuse this use of C<=> with +either ordinary assignment or ordinary binding. You should read +the C<=> more like the pseudoassignment of a declarator than like +normal assignment. It's more like the ordinary C<:=> operator, +since at the level regexes work, strings are immutable, so captures +are really just precomputed substr values. Nevertheless, when you +eventually use the values independently, the substr may be copied, +and then it's more like it was an assignment originally. + +Capture variables of the form C<< $<ident> >> may persist beyond +the lexical scope; if the match succeeds they are remembered in the +C<Match> object's hash, with a key corresponding to the variable name's +identifier. Likewise bound numeric variables persist as C<$0>, etc. + +The capture performed by C<=> creates a new lexical variable if it does +not already exist in the current lexical scope. To capture to an outer +lexical variable you must supply an C<OUTER::> as part of the name, +or perform the assignment from within a closure. + + $x = [...] # capture to our own lexical $x + $OUTER::x = [...] # capture to existing lexical $x + [...] -> $tmp { let $x = $tmp } # capture to existing lexical $x -Variables used in bindings are lexically scoped to the rest of the regex. -If the match succeeds they are remembered in the C<Match> object's hash, -with a key corresponding to the variable name without the sigil. +Note however that C<let> (and C<temp>) are not guaranteed to be thread +safe on shared variables, so don't do that. =item * @@ -1010,7 +1052,7 @@ is just shorthand for - $foo=<bar> + $<foo> = <bar> If the first character after the identifier is whitespace, the subsequent text (following any whitespace) is passed as a regex, so: @@ -1043,17 +1085,18 @@ The special named assertions include: - / <before pattern> / # was /(?=pattern)/ - / <after pattern> / # was /(?<=pattern)/ + / <?before pattern> / # lookahead + / <?after pattern> / # lookbehind + + / <?same> / # true between two identical characters - / <sp> / # match the SPACE character (U+0020) - / <ws> / # match "whitespace": - # \s+ if it's between two \w characters, - # \s* otherwise - - / <at($pos)> / # match only at a particular StrPos - # short for <?{ .pos === $pos }> - # (considered declarative until $pos changes) + / <.ws> / # match "whitespace": + # \s+ if it's between two \w characters, + # \s* otherwise + + / <?at($pos)> / # match only at a particular StrPos + # short for <?{ .pos === $pos }> + # (considered declarative until $pos changes) The C<after> assertion implements lookbehind by reversing the syntax tree and looking for things in the opposite order going to the left. @@ -1082,7 +1125,7 @@ string is never matched literally. Such an assertion is not captured. (No assertion with leading punctuation -is captured by default.) You may always bind it explicitly, of course. +is captured by default.) You may always capture it explicitly, of course. A subrule is considered declarative to the extent that the front of it is declarative, and to the extent that the variable doesn't change. @@ -1509,8 +1552,8 @@ The two cases can always be distinguished using C<m{...}> or C<rx{...}>: - $var = m{pattern}; # Match regex immediately, assign result - $var = rx{pattern}; # Assign regex expression itself + $match = m{pattern}; # Match regex immediately, assign result + $regex = rx{pattern}; # Assign regex expression itself =item * @@ -2003,11 +2046,11 @@ When used as a scalar, a C<Match> object evaluates to its underlying result object. Usually this is just the entire match string, but -you can override that by calling C<reduce> inside a regex: +you can override that by calling C<make> inside a regex: my $moose = $(m:{ <antler> <body> - { reduce Moose.new( body => $body().attach($antler) ) } + { make Moose.new( body => $body().attach($antler) ) } # match succeeds -- ignore the rest of the regex }); @@ -2030,8 +2073,8 @@ This means that these two work the same: - / <moose> { reduce $moose as Moose } / - / <moose> { reduce $$moose as Moose } / + / <moose> { make $moose as Moose } / + / <moose> { make $$moose as Moose } / =item * @@ -2129,11 +2172,11 @@ Fortunately, when you just want to return a different result object instead of the default C<Match> object, you may associate your return value with -the current match state using the C<reduce> function, which works something +the current match state using the C<make> function, which works something like a C<return>, but doesn't clobber the match state: $str ~~ / foo # Match 'foo' - { reduce 'bar' } # But pretend we matched 'bar' + { make 'bar' } # But pretend we matched 'bar' /; say $(); # says 'bar' @@ -2454,7 +2497,7 @@ # subrule subrule subrule # __^__ _______^_____ __^__ # | | | | | | - m/ <ident> $spaces = (\s*) <digit>+ / + m/ <ident> $<spaces>=(\s*) <digit>+ / =item * @@ -2496,7 +2539,7 @@ Note that it makes no difference whether a subrule is angle-bracketed (C<< <ident> >>) or aliased internally (C<< <ident=name> >>) or aliased -externally (C<< $ident = (<alpha>\w*) >>). The name's the thing. +externally (C<< $<ident>=(<alpha>\w*) >>). The name's the thing. =back @@ -2541,7 +2584,7 @@ =item * However, if a subrule is explicitly renamed (or aliased -- see L</Aliasing>), -then only the I<final> name counts when deciding whether it is or isn't +then only the I<new> name counts when deciding whether it is or isn't repeated. For example: if mm/ mv <file> <dir=file> / { @@ -2601,7 +2644,7 @@ # ______/capturing parens\______ # | | # | | - mm/ $key = ( (<[A..E]>) (\d**{3..6}) (X?) ) /; + mm/ $<key>=( (<[A..E]>) (\d**{3..6}) (X?) ) /; then the outer capturing parens no longer capture into the array of C<$/> as unaliased parens would. Instead the aliased parens capture @@ -2659,7 +2702,7 @@ # ___/non-capturing brackets\___ # | | # | | - mm/ $key = [ (<[A..E]>) (\d**{3..6}) (X?) ] /; + mm/ $<key>=[ (<[A..E]>) (\d**{3..6}) (X?) ] /; then the corresponding C<< $/<key> >> Match object contains only the string matched by the non-capturing brackets. @@ -2833,7 +2876,7 @@ } - if m/ mv \s+ $from=(\S+ \s+)* / { + if m/ mv \s+ $<from>=(\S+ \s+)* / { # Quantified subpattern returns a list of Match objects, # so $/<from> contains an array of Match # objects, one for each successful match of the subpattern @@ -2849,7 +2892,7 @@ brackets (as described in L<Named scalar aliases applied to non-capturing brackets>). For example: - "coffee fifo fumble" ~~ m/ $effs = [f <-[f]>**{1..2} \s*]+ /; + "coffee fifo fumble" ~~ m/ $<effs>=[f <-[f]>**{1..2} \s*]+ /; say $<effs>; # prints "fee fifo fum" @@ -2865,7 +2908,7 @@ An alias can also be specified using an array as the alias instead of a scalar. For example: - m/ mv \s+ @from = [(\S+) \s+]* <dir> /; + m/ mv \s+ @<from>=[(\S+) \s+]* <dir> /; =item * @@ -2877,8 +2920,8 @@ structurally different alternations (by enforcing array captures in all branches): - mm/ Mr?s? @names=<ident> W\. @names=<ident> - | Mr?s? @names=<ident> + mm/ Mr?s? @<names>=<ident> W\. @<names>=<ident> + | Mr?s? @<names>=<ident> /; # Aliasing to @names means $/<names> is always @@ -2891,8 +2934,8 @@ For convenience and consistency, C<< @<key> >> can also be used outside a regex, as a shorthand for C<< @( $/<key> ) >>. That is: - mm/ Mr?s? @names=<ident> W\. @names=<ident> - | Mr?s? @names=<ident> + mm/ Mr?s? @<names>=<ident> W\. @<names>=<ident> + | Mr?s? @<names>=<ident> /; say @<names>; @@ -2903,18 +2946,18 @@ brackets, it captures the substrings matched by each repetition of the brackets into separate elements of the corresponding array. That is: - mm/ mv $files=[ f.. \s* ]* /; # $/<files> assigned a single - # Match object containing the - # complete substring matched by - # the full set of repetitions - # of the non-capturing brackets - - mm/ mv @files=[ f.. \s* ]* /; # $/<files> assigned an array, - # each element of which is a - # Match object containing - # the substring matched by Nth - # repetition of the non- - # capturing bracket match + mm/ mv $<files>=[ f.. \s* ]* /; # $/<files> assigned a single + # Match object containing the + # complete substring matched by + # the full set of repetitions + # of the non-capturing brackets + + mm/ mv @<files>=[ f.. \s* ]* /; # $/<files> assigned an array, + # each element of which is a + # Match object containing + # the substring matched by Nth + # repetition of the non- + # capturing bracket match =item * @@ -2925,7 +2968,7 @@ an array alias on a subpattern flattens and collects all nested subpattern captures within the aliased subpattern. For example: - if mm/ $pairs=( (\w+) \: (\N+) )+ / { + if mm/ $<pairs>=( (\w+) \: (\N+) )+ / { # Scalar alias, so $/<pairs> is assigned an array # of Match objects, each of which has its own array # of two subcaptures... @@ -2937,7 +2980,7 @@ } - if mm/ @pairs=( (\w+) \: (\N+) )+ / { + if mm/ @<pairs>=( (\w+) \: (\N+) )+ / { # Array alias, so $/<pairs> is assigned an array # of Match objects, each of which is flattened out of # the two subcaptures within the subpattern @@ -2957,7 +3000,7 @@ rule pair { (\w+) \: (\N+) \n } - if mm/ $pairs=<pair>+ / { + if mm/ $<pairs>=<pair>+ / { # Scalar alias, so $/<pairs> contains an array of # Match objects, each of which is the result of the # <pair> subrule call... @@ -2969,7 +3012,7 @@ } - if mm/ mv @pairs=<pair>+ / { + if mm/ mv @<pairs>=<pair>+ / { # Array alias, so $/<pairs> contains an array of # Match objects, all flattened down from the # nested arrays inside the Match objects returned @@ -3032,7 +3075,7 @@ An alias can also be specified using a hash as the alias variable, instead of a scalar or an array. For example: - m/ mv %location=( (<ident>) \: (\N+) )+ /; + m/ mv %<location>=( (<ident>) \: (\N+) )+ /; =item * @@ -3086,11 +3129,11 @@ Instead of using internal aliases like: - m/ mv @files=<ident>+ $dir=<ident> / + m/ mv @<files>=<ident>+ $<dir>=<ident> / the name of an ordinary variable can be used as an I<external> alias, like so: - m/ mv @files=<ident>+ $dir=<ident> / + m/ mv @OUTER::files=<ident>+ $OUTER::dir=<ident> / =item * @@ -3243,20 +3286,20 @@ grammar Letter { rule text { <greet> <body> <close> } - rule greet { [Hi|Hey|Yo] $to=(\S+?) , $$} + rule greet { [Hi|Hey|Yo] $<to>=(\S+?) , $$} rule body { <line>+? } # note: backtracks forwards via +? - rule close { Later dude, $from=(.+) } + rule close { Later dude, $<from>=(.+) } # etc. } grammar FormalLetter is Letter { - rule greet { Dear $to=(\S+?) , $$} + rule greet { Dear $<to>=(\S+?) , $$} - rule close { Yours sincerely, $from=(.+) } + rule close { Yours sincerely, $<from>=(.+) } } @@ -3552,4 +3595,30 @@ =back +=head1 When C<$/> is valid + +To provide implementational freedom, the C<$/> variable is not +guaranteed to be defined until the pattern reaches a sequence +point that requires it (such as completing the match, or calling an +embedded closure, or even evaluating a submatch that requires a Perl +expression for its argument). Within regex code, C<$/> is officially +undefined, and references to C<$0> or other capture variables may +be compiled to produce the current value without reference to C<$/>. +Likewise a reference to C<< $<foo> >> does not necessarily mean C<< +$/<foo> >> within the regex proper. During the execution of a match, +the current match state is likely to be stored in a C<$_> variable +lexically scoped to an appropriate portion of the match, but that is +not guaranteed to behave the same as the C<$/> object, because C<$/> +is of type C<Match>, while the match state is of type C<Cursor>. +(It really depends on the implementation of the pattern matching +engine.) + +In any case this is all transparent to the user for simple matches; +and outside of regex code (and inside closures within the regex) +the C<$/> variable is guaranteed to represent the state of the match +at that point. That is, normal Perl code can always depend on C<< +$<foo> >> meaning C<< $/<foo> >>, and C<$0> meaning C<$/[0]>, whether +that code is embedded in a closure within the regex or outside the +regex after the match completes. + =for vim:set expandtab sw=4: