Author: larry Date: Tue Sep 11 11:54:28 2007 New Revision: 14454 Modified: doc/trunk/design/syn/S05.pod
Log: Last (we hope) major revision of regex syntax. Modified: doc/trunk/design/syn/S05.pod ============================================================================== --- doc/trunk/design/syn/S05.pod (original) +++ doc/trunk/design/syn/S05.pod Tue Sep 11 11:54:28 2007 @@ -14,9 +14,9 @@ Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and Larry Wall <[EMAIL PROTECTED]> Date: 24 Jun 2002 - Last Modified: 6 Sep 2007 + Last Modified: 11 Sep 2007 Number: 5 - Version: 64 + Version: 65 This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them I<regex> rather than "regular @@ -36,14 +36,18 @@ =head1 New match result and capture variables The underlying match result object is now available as the C<$/> -variable, which is implicitly lexically scoped. All access to the -current (or most recent) match is through this variable, even when +variable, which is implicitly lexically scoped. All user access to the +most recent match is through this variable, even when it doesn't look like it. The individual capture variables (such as C<$0>, C<$1>, etc.) are just elements of C<$/>. By the way, unlike in Perl 5, the numbered capture variables now start at C<$0> instead of C<$1>. See below. +During the execution of a match, the current match state is stored in a +C<$_> variable lexically scoped to an appropriate portion of the match. +This is transparent to the user for simple matches. + =head1 Unchanged syntactic features The following regex features use the same syntax as in Perl 5: @@ -75,9 +79,11 @@ While the syntax of C<|> does not change, the default semantics do change slightly. We are attempting to concoct a pleasing mixture of declarative and procedural matching so that we can have the -best of both. See the section below on "Longest-token matching". +best of both. In short, you need not write your own tokener for +a grammar because Perl will write one for you. See the section +below on "Longest-token matching". -=head1 Simplified lexical parsing +=head1 Simplified lexical parsing of patterns Unlike traditional regular expressions, Perl 6 does not require you to memorize an arbitrary list of metacharacters. Instead it @@ -202,58 +208,49 @@ =item * The C<:c> (or C<:continue>) modifier causes the pattern to continue -scanning from the string's current C<.pos>: +scanning from the specified position (defaulting to C<$/.to>): - m:c/ pattern / # start at end of - # previous match on $_ + m:c($p)/ pattern / # start scanning at position $p Note that this does not automatically anchor the pattern to the starting location. (Use C<:p> for that.) The pattern you supply to C<split> has an implicit C<:c> modifier. -The C<:continue> modifier takes an optional argument of type C<StrPos> -which specifies the point at which to start scanning for a match. -This should not be used unless you know what you're doing, or just -happen to like hard-to-debug infinite loops. +String positions are of type C<StrPos> and should generally be treated +as opaque. =item * The C<:p> (or C<:pos>) modifier causes the pattern to try to match only at -the string's current C<.pos>: +the specified string position: - m:p/ pattern / # match at end of - # previous match on $_ + m:pos($p)/ pattern / # match at position $p -Since this is implicitly anchored to the position, it's suitable for -building parsers and lexers. The pattern you supply to a Perl macro's -C<is parsed> trait has an implicit C<:p> modifier. +If the argument is omitted, it defaults to C<$/.to>. (Unlike in +Perl 5, the string itself has no clue where its last match ended.) +All subrule matches are implicitly passed their starting position. +Likewise, the pattern you supply to a Perl macro's C<is parsed> +trait has an implicit C<:p> modifier. Note that - m:c/pattern/ + m:c($p)/pattern/ is roughly equivalent to - m:p/.*? <( pattern )> / - -Also note that any regex called as a subrule is implicitly anchored to the -current position anyway. - -The C<:pos> modifier takes an optional argument of type C<StrPos> -which specifies the point at which to attempt a match. This should not -be used lightly. Put it in the category of a "goto". + m:p($p)/.*? <( pattern )> / =item * The new C<:s> (C<:sigspace>) modifier causes whitespace sequences to be considered "significant"; they are replaced by a whitespace -matching rule, C<< <+ws> >>. That is, +matching rule, C<< <.ws> >>. That is, m:s/ next cmd = <condition>/ is the same as: - m/ <+ws> next <+ws> cmd <+ws> = <+ws> <condition>/ + m/ <.ws> next <.ws> cmd <.ws> = <.ws> <condition>/ which is effectively the same as: @@ -265,9 +262,9 @@ or equivalently, - m { (a|\*) <+ws> (b|\+) } + m { (a|\*) <.ws> (b|\+) } -C<< <+ws> >> can't decide what to do until it sees the data. +C<< <.ws> >> can't decide what to do until it sees the data. It still does the right thing. If not, define your own C<< ws >> and C<:sigspace> will use that. @@ -275,8 +272,8 @@ the parser rules automatically handle whitespace policy for you. In this context, whitespace often includes comments, depending on how the grammar chooses to define its whitespace rule. Although the -default C<< <+ws> >> subrule recognizes no comment construct, any -grammar is free to override the rule. The C<< <+ws> >> rule is not +default C<< <.ws> >> subrule recognizes no comment construct, any +grammar is free to override the rule. The C<< <.ws> >> rule is not intended to mean the same thing everywhere. It's also possible to pass an argument to C<:sigspace> specifying @@ -285,7 +282,7 @@ important to distinguish the significant whitespace in the pattern from the "whitespace" being matched, so we'll call the pattern's whitespace I<sigspace>, and generally reserve I<whitespace> to indicate whatever -C<< <+ws> >> matches in the current grammar. The correspondence +C<< <.ws> >> matches in the current grammar. The correspondence between sigspace and whitespace is primarily metaphorical, which is why the correspondence is both useful and (potentially) confusing. @@ -336,16 +333,15 @@ If followed by an C<x>, it means repetition. Use C<:x(4)> for the general form. So - s:4x [ (<+ident>) = (\N+) $$] [$0 => $1]; + s:4x [ (<.ident>) = (\N+) $$] [$0 => $1]; is the same as: - s:x(4) [ (<+ident>) = (\N+) $$] [$0 => $1]; + s:x(4) [ (<.ident>) = (\N+) $$] [$0 => $1]; which is almost the same as: - $_.pos = 0; - s:c[ (<+ident>) = (\N+) $$] = "$0 => $1" for 1..4; + s:c[ (<.ident>) = (\N+) $$] = "$0 => $1" for 1..4; except that the string is unchanged unless all four matches are found. However, ranges are allowed, so you can say C<:x(1..4)> to change anywhere @@ -418,6 +414,9 @@ (especially if it isn't implemented yet, or is never implemented), all pieces of C<$/> are considered copy-on-write, if not read-only. +[Conjecture: this should really associate a pattern with a string variable, +not a (presumably immutable) string value.] + =item * The new C<:keepall> modifier causes this regex and all invoked subrules @@ -450,7 +449,7 @@ and these are equivalent to $string ~~ m/^ \d+: $/; - $string ~~ m/^ <+ws> \d+: <+ws> $/; + $string ~~ m/^ <.ws> \d+: <.ws> $/; =item * @@ -778,7 +777,7 @@ However, a variable used as the left side of a binding or submatch operator is not used for matching. - $x := <ident> + $x = <ident> $0 ~~ <ident> If you do want to match C<$0> again and then use that as the submatch, @@ -788,7 +787,11 @@ It is non-sensical to bind to something that is not a variable: - "$0" := <ident> # ERROR + "$0" = <ident> # ERROR + +Variables used in bindings are lexically scoped to the rest of the regex. +If the match succeeds they are remembered in the C<Match> object's hash, +with a key corresponding to the variable name without the sigil. =item * @@ -990,6 +993,15 @@ <foo('bar')> +If the first character after the identifier is an C<=>, then the identifier +is taken as an alias for what follows. In particular, + + <foo=bar> + +is just shorthand for + + $foo=<bar> + If the first character after the identifier is whitespace, the subsequent text (following any whitespace) is passed as a regex, so: @@ -1009,22 +1021,7 @@ To pass a string with leading whitespace, or to interpolate any values into the string, you must use the parenthesized form. -If the first character is a plus or minus, the rest of the assertion -is parsed as a set of character classes (though the definition of -character class is intentionally vague, and may include any other rule -whether it matches characters exclusively or not). - -An initial identifier is taken as a character class, so the first -character after the identifier doesn't matter in this case, and you -can use whitespace however you like. Therefore - - <foo+bar-baz> - -can be written - - <+ foo + bar - baz> - -Likewise an initial left square bracket indicates character class syntax. (See below.) +No other characters are allowed after the initial identifier. Subrule matches are considered declarative to the extent that the front of the subrule is itself considered declarative. If a @@ -1045,7 +1042,7 @@ # \s* otherwise / <at($pos)> / # match only at a particular StrPos - # short for <?{ .pos == $pos }> + # short for <?{ .pos === $pos }> # (considered declarative until $pos changes) The C<after> assertion implements lookbehind by reversing the syntax @@ -1059,30 +1056,23 @@ =item * -A leading C<+> causes a named assertion not to capture what it matches (see +A leading C<.> causes a named assertion not to capture what it matches (see L<Subrule captures>. For example: / <ident> <ws> / # $/<ident> and $/<ws> both captured - / <+ident> <ws> / # only $/<ws> captured - / <+ident> <+ws> / # nothing captured + / <.ident> <ws> / # only $/<ws> captured + / <.ident> <.ws> / # nothing captured The non-capturing behavior may be overridden with a C<:keepall>. -The rest of the assertion is reparsed as if the C<+> (and any following -whitespace) weren't there, so it is legal (but redundant) to say: - - <+++ws> - <+ + +ws> - =item * A leading C<$> indicates an indirect subrule. The variable must contain either a C<Regex> object, or a string to be compiled as the regex. The string is never matched literally. -By default C<< <$foo> >> is captured into C<< $<foo> >>, but you can -use the C<< <+$foo> >> form to suppress capture, and you can always say -C<< $<$foo> := <$foo> >> if you prefer to include the sigil in the key. +Such an assertion is not captured. (No assertion with leading punctuation +is captured by default.) You may always bind it explicitly, of course. A subrule is considered declarative to the extent that the front of it is declarative, and to the extent that the variable doesn't change. @@ -1108,9 +1098,7 @@ That is, a string is forced to be compiled as a subrule instead of being matched literally. (There is no difference for a C<Regex> object.) -By default C<< <@foo> >> is captured into C<< $<foo> >>, but you can -use the C<< <[EMAIL PROTECTED]> >> form to suppress capture, and you can always say -C<< $<@foo> := <@foo> >> if you prefer to include the sigil in the key. +This assertion is not automatically captured. =item * @@ -1119,9 +1107,7 @@ to a regex at match time. (Numeric values may still indicate "false match". and a closure may do whatever it likes.) -By default C<< <%foo> >> is captured into C<< $<foo> >>, but you can -use the C<< <+%foo> >> form to suppress capture, and you can always say -C<< $<%foo> := <%foo> >> if you prefer to include the sigil in the key. +This assertion is not automatically captured. As with bare hash, the longest key matches according to the venerable I<longest-token rule>. @@ -1131,7 +1117,7 @@ A leading C<{> indicates code that produces a regex to be interpolated into the pattern at that point as a subrule: - / (<+ident>) <{ %cache{$0} //= get_body_for($0) }> / + / (<.ident>) <{ %cache{$0} //= get_body_for($0) }> / The closure is guaranteed to be run at the canonical time; it declares a sequence point, and is considered to be procedural. @@ -1169,7 +1155,7 @@ time you use it unless the string changes. (Any external lexical variable names must be rebound each time though.) Subrules may not be interpolated with unbalanced bracketing. An interpolated subrule -keeps its own inner C<$/>, so its parentheses never count toward the +keeps its own inner match result as a single item, so its parentheses never count toward the outer regexes groupings. (In other words, parenthesis numbering is always lexically scoped.) @@ -1201,7 +1187,7 @@ / <[a..z_]>* / -Whitespace is ignored within square brackets and after the initial C<+>. +Whitespace is ignored within square brackets: / <[ a..z _ ]>* / @@ -1210,6 +1196,7 @@ A leading C<-> indicates a complemented character class: / <-[a..z_]> <-alpha> / + / <- [a..z_]> <- alpha> / # whitespace allowed after - This is essentially the same as using negative lookahead and dot: @@ -1220,11 +1207,11 @@ =item * A leading C<+> may also be supplied to indicate that the following -character class is to matched in a positive sense +character class is to matched in a positive sense. / <+[a..z_]>* / / <+[ a..z _ ]>* / - / <+[ a .. z _ ] >* / + / <+ [ a .. z _ ] >* / # whitespace allowed after + =item * @@ -1233,18 +1220,12 @@ / <[a..z] - [aeiou] + xdigit> / # consonant or hex digit -If such a combination starts with a named character class, a leading -C<+> is allowed but not required, provided the next character is a -character set operation: - - / <+alpha-[Jj]> / # J-less alpha - / <alpha-[Jj]> / # same thing - / <+alpha - [ Jj ]> / # still the same thing +A named character class may be used by itself: -However, whitespace is not allowed after the first identifier if it -immediately follows the left angle. + <alpha> - / <alpha - [Jj]> / # WRONG, means <alpha(/- [Jj]/)> +However, in order to combine classes you must prefix a named +character class with C<+> or C<->. =item * @@ -1278,8 +1259,8 @@ were not there. In addition to forcing zero-width, it also suppresses any named capture: - <alpha> # match a letter and capture in $<alpha> - <+alpha> # match a letter, don't capture + <alpha> # match a letter and capture to $alpha (eventually $<alpha>) + <.alpha> # match a letter, don't capture <?alpha> # match null before a letter, don't capture =item * @@ -1291,7 +1272,7 @@ <~~> # call myself recursively <~~0> # match according to $0's pattern - <~~foo> # match according to $<foo>'s rule + <~~foo> # match according to $foo's pattern Note that this rematches the pattern associated with the name, not the string matched. So @@ -1346,7 +1327,7 @@ match "C<foo>" backwards. The use of C<< <(...)> >> affects only the meaning of the I<result object> and the positions of the beginning and ending of the match. That is, after the match above, C<$()> contains -only the digits matched, and C<.pos> is pointing to after the digits. +only the digits matched, and C<$/.to> is pointing to after the digits. Other captures (named or numbered) are unaffected and may be accessed through C<$/>. @@ -1356,7 +1337,7 @@ A C<«> or C<<< << >>> token indicates a left word boundary. A C<»> or C<<< >> >>> token indicates a right word boundary. (As separate tokens, -these need not be balanced.) Perl 5's C<\b> is replaced by a C<< <+wb> >> +these need not be balanced.) Perl 5's C<\b> is replaced by a C<< <.wb> >> "word boundary" assertion, while C<\B> becomes C<< <!wb> >>. (None of these are dependent on the definition of C<< <ws> >>, but only on the C<\w> definition of "word" characters.) @@ -1768,31 +1749,36 @@ =item * -The null pattern is now illegal. +The empty pattern is now illegal. =item * To match whatever the prior successful regex matched, use: - /<prior>/ + / <prior> / =item * -To match the zero-width string, use: +To match the zero-width string, you must use some explicit +representation of the null match: - /<null>/ + / '' /; + / <?> /; For example: - split /<+null>/, $string + split /''/, $string + +splits between characters. But then, so does this: -splits between characters. + split '', $string =item * -To match a null alternative, use: +Likewise, to match a empty alternative, use something like: - /a|b|c|<+null>/ + /a|b|c|<?>/ + /a|b|c|''/ This makes it easier to catch errors like this: @@ -1828,7 +1814,8 @@ $something = ""; /a|b|c|$something/; -In particular, <?> also matches the null string, and <!> always fails. +In particular, C<< <?> >> always matches the null string successfuly, +and C<< <!> >> always fails to match anything. =back @@ -1887,7 +1874,7 @@ =item * -Any atom that is quantified with a minimally match (using the C<?> modifier). +Any atom that is quantified with a minimal match (using the C<?> modifier). =item * @@ -1915,9 +1902,15 @@ are simulated in any of various ways, such as by Thompson NFA, it may be possible to know when to fire off the assertions without backchecks.) -Greedy quantifiers and characters classes do not terminate a token pattern. +Greedy quantifiers and character classes do not terminate a token pattern. Zero-width assertions such as word boundaries are also okay. +For a pattern that starts with a positive lookahead assertion, +the assertion is assumed to be more specific than the subsequent +pattern, so the lookahead's pattern is treated as the longest token; +the longest-token matcher will be smart enough to rematch any text +traversed by the lookahead when (and if) it continues the match. + Oddly enough, the C<token> keyword specifically does not determine the scope of a token, except insofar as a token pattern usually doesn't do much matching of whitespace. In contrast, the C<rule> @@ -1959,9 +1952,11 @@ A match always returns a Match object, which is also available as C<$/>, which is a contextual lexical declared in the outer -subroutine that is calling the regex. (A closure lexically embedded -in a regex does not redeclare C<$/>, so C<$/> always refers to the -current match, not any prior submatch done within the closure). +subroutine that is calling the regex. (A regex declares its own +lexical C<$/> variable, which always refers to the most recent +submatch within the rule, if any.) The current match state is +kept in the regex's C<$_> variable which will eventually get +processed into the user's C<$/> variable when the match completes. =item * @@ -1991,9 +1986,9 @@ In string context it evaluates to the stringified value of its I<result object>, which is usually the entire matched string: - print %hash{ "{$text ~~ /<+ident>/}" }; + print %hash{ "{$text ~~ /<.ident>/}" }; # or equivalently: - $text ~~ /<+ident>/ && print %hash{~$/}; + $text ~~ /<.ident>/ && print %hash{~$/}; But generally you should say C<~$/> if you mean C<~$/>. @@ -2010,11 +2005,11 @@ When used as a scalar, a C<Match> object evaluates to its underlying result object. Usually this is just the entire match string, but -you can override that by calling C<return> inside a regex: +you can override that by calling C<reduce> inside a regex: my $moose = $(m:{ <antler> <body> - { return Moose.new( body => $<body>().attach($<antler>) ) } + { reduce Moose.new( body => $body().attach($antler) ) } # match succeeds -- ignore the rest of the regex }); @@ -2037,8 +2032,8 @@ This means that these two work the same: - / <moose> { return $$<moose> as Moose } / - / <moose> { return $<moose> as Moose } / + / <moose> { reduce $moose as Moose } / + / <moose> { reduce $$moose as Moose } / =item * @@ -2120,28 +2115,27 @@ =item * This returned object is also automatically assigned to the lexical -C<$/> variable, unless the match statement is inside another regex. That is: +C<$/> variable of the current surroundings. That is: $str ~~ /pattern/; say "Matched" if $/; =item * -Inside a regex, the C<$/> variable holds the current regex's -incomplete C<Match> object (which can be modified via the internal C<$/>). -For example: - - $str ~~ / foo # Match 'foo' - { $/ = 'bar' } # But pretend we matched 'bar' - /; - say $/; # says 'bar' - -This is slightly dangerous, insofar as you might return something that -does not behave like a C<Match> object to some context that requires -one. Fortunately, you normally just want to return a result object instead: +Inside a regex, the C<$_> variable holds the current regex's incomplete +C<Match> object, known as a match state. Generally this should not +be modified unless you know how to create and propagate match states. +All regexes actually return match states even when you think they're +returning something else, because the match states keep track of +the success and failures of the pattern for you. + +Fortunately, when you just want to return a different result object instead +of the default C<Match> object, you may associate your return value with +the current match state using the C<reduce> function, which works something +like a C<return>, but doesn't clobber the match state: $str ~~ / foo # Match 'foo' - { return 'bar' } # But pretend we matched 'bar' + { reduce 'bar' } # But pretend we matched 'bar' /; say $(); # says 'bar' @@ -2459,10 +2453,10 @@ For example, this regex contains three subrules: - # subrule subrule subrule - # __^__ _______^______ __^__ - # | | | | | | - m/ <ident> $<spaces>:=(\s*) <digit>+ / + # subrule subrule subrule + # __^__ _______^_____ __^__ + # | | | | | | + m/ <ident> $spaces = (\s*) <digit>+ / =item * @@ -2503,8 +2497,8 @@ =item * Note that it makes no difference whether a subrule is angle-bracketed -(C<< <ident> >>) or aliased (C<< $<ident> := (<alpha>\w*) >>). The name's -the thing. +(C<< <ident> >>) or aliased internally (C<< <ident=name> >>) or aliased +externally (C<< $ident = (<alpha>\w*) >>). The name's the thing. =back @@ -2552,7 +2546,7 @@ then only the I<final> name counts when deciding whether it is or isn't repeated. For example: - if mm/ mv <file> $<dir>:=<file> / { + if mm/ mv <file> <dir=file> / { $from = $<file>; # Only one subrule named <file>, so scalar $to = $<dir>; # The Capture Formerly Known As <file> } @@ -2606,10 +2600,10 @@ If a named scalar alias is applied to a set of I<capturing> parens: - # ______/capturing parens\______ - # | | - # | | - mm/ $<key>:=( (<[A..E]>) (\d**{3..6}) (X?) ) /; + # ______/capturing parens\______ + # | | + # | | + mm/ $key = ( (<[A..E]>) (\d**{3..6}) (X?) ) /; then the outer capturing parens no longer capture into the array of C<$/> as unaliased parens would. Instead the aliased parens capture @@ -2664,10 +2658,10 @@ If a named scalar alias is applied to a set of I<non-capturing> brackets: - # ___/non-capturing brackets\___ - # | | - # | | - mm/ $<key>:=[ (<[A..E]>) (\d**{3..6}) (X?) ] /; + # ___/non-capturing brackets\___ + # | | + # | | + mm/ $key = [ (<[A..E]>) (\d**{3..6}) (X?) ] /; then the corresponding C<< $/<key> >> Match object contains only the string matched by the non-capturing brackets. @@ -2717,7 +2711,7 @@ entry whose key is the name of the alias. And it I<no longer> assigns anything to the hash entry whose key is the subrule name. That is: - if m/ ID\: $<id>:=<ident> / { + if m/ ID\: <id=ident> / { say "Identified as $/<id>"; # $/<ident> is undefined } @@ -2727,7 +2721,7 @@ object. This is particularly useful for differentiating two or more calls to the same subrule in the same scope. For example: - if mm/ mv <file>+ $<dir>:=<file> / { + if mm/ mv <file>+ <dir=file> / { @from = @($<file>); $to = $<dir>; } @@ -2742,7 +2736,7 @@ If a numbered alias is used instead of a named alias: - m/ $1:=(<-[:]>*) \: $0:=<ident> / + m/ $1=(<-[:]>*) \: $0=<ident> / the behavior is exactly the same as for a named alias (i.e. the various cases described above), except that the resulting C<Match> object is @@ -2756,9 +2750,9 @@ alias number (much like enum values increment from the last explicit value). That is: - # ---$1--- -$2- ---$6--- -$7- - # | | | | | | | | - m/ $1:=(food) (bard) $6:=(bazd) (quxd) /; + # --$1--- -$2- --$6--- -$7- + # | | | | | | | | + m/ $1=(food) (bard) $6=(bazd) (quxd) /; =item * @@ -2766,8 +2760,8 @@ Perl5 semantics for consecutive subpattern numbering in alternations: $tune_up = rx/ (don't) (ray) (me) (for) (solar tea), (d'oh!) - | $6:=(every) (green) (BEM) (devours) (faces) - # $7 $8 $9 $10 + | $6 = (every) (green) (BEM) (devours) (faces) + # $7 $8 $9 $10 /; =item * @@ -2794,12 +2788,12 @@ # Perl 6 simulating Perl 5... - # $1 - # ________________/\________________ - # | $2 $3 $4 | - # | ___/\___ ____/\____ /\ | - # | | | | | | | | - m/ $1:=[ (<[A..E]>) (\d**{3..6}) (X?) ] /; + # $1 + # _______________/\________________ + # | $2 $3 $4 | + # | ___/\___ ____/\____ /\ | + # | | | | | | | | + m/ $1=[ (<[A..E]>) (\d**{3..6}) (X?) ] /; The non-capturing brackets don't introduce a scope, so the subpatterns within them are at regex scope, and hence numbered at the top level. Aliasing the @@ -2832,7 +2826,7 @@ In other words, aliasing and quantification are completely orthogonal. For example: - if mm/ mv $0:=<file>+ / { + if mm/ mv $0=<file>+ / { # <file>+ returns a list of Match objects, # so $0 contains an array of Match objects, # one for each successful call to <file> @@ -2841,7 +2835,7 @@ } - if m/ mv \s+ $<from>:=(\S+ \s+)* / { + if m/ mv \s+ $from=(\S+ \s+)* / { # Quantified subpattern returns a list of Match objects, # so $/<from> contains an array of Match # objects, one for each successful match of the subpattern @@ -2857,7 +2851,7 @@ brackets (as described in L<Named scalar aliases applied to non-capturing brackets>). For example: - "coffee fifo fumble" ~~ m/ $<effs>:=[f <-[f]>**{1..2} \s*]+ /; + "coffee fifo fumble" ~~ m/ $effs = [f <-[f]>**{1..2} \s*]+ /; say $<effs>; # prints "fee fifo fum" @@ -2873,11 +2867,11 @@ An alias can also be specified using an array as the alias instead of a scalar. For example: - m/ mv \s+ @<from>:=[(\S+) \s+]* <dir> /; + m/ mv \s+ @from = [(\S+) \s+]* <dir> /; =item * -Using the C<< @<alias>:= >> notation instead of a C<< $<alias>:= >> +Using the C<< @alias= >> notation instead of a C<< $alias= >> mandates that the corresponding hash entry or array element I<always> receives an array of C<Match> objects, even if the construct being aliased would normally return a single C<Match> object. @@ -2885,11 +2879,11 @@ structurally different alternations (by enforcing array captures in all branches): - mm/ Mr?s? @<names>:=<ident> W\. @<names>:=<ident> - | Mr?s? @<names>:=<ident> + mm/ Mr?s? @names=<ident> W\. @names=<ident> + | Mr?s? @names=<ident> /; - # Aliasing to @<names> means $/<names> is always + # Aliasing to @names means $/<names> is always # an Array object, so... say @($/<names>); @@ -2899,8 +2893,8 @@ For convenience and consistency, C<< @<key> >> can also be used outside a regex, as a shorthand for C<< @( $/<key> ) >>. That is: - mm/ Mr?s? @<names>:=<ident> W\. @<names>:=<ident> - | Mr?s? @<names>:=<ident> + mm/ Mr?s? @names=<ident> W\. @names=<ident> + | Mr?s? @names=<ident> /; say @<names>; @@ -2911,18 +2905,18 @@ brackets, it captures the substrings matched by each repetition of the brackets into separate elements of the corresponding array. That is: - mm/ mv $<files>:=[ f.. \s* ]* /; # $/<files> assigned a single - # Match object containing the - # complete substring matched by - # the full set of repetitions - # of the non-capturing brackets - - mm/ mv @<files>:=[ f.. \s* ]* /; # $/<files> assigned an array, - # each element of which is a - # Match object containing - # the substring matched by Nth - # repetition of the non- - # capturing bracket match + mm/ mv $files=[ f.. \s* ]* /; # $/<files> assigned a single + # Match object containing the + # complete substring matched by + # the full set of repetitions + # of the non-capturing brackets + + mm/ mv @files=[ f.. \s* ]* /; # $/<files> assigned an array, + # each element of which is a + # Match object containing + # the substring matched by Nth + # repetition of the non- + # capturing bracket match =item * @@ -2933,7 +2927,7 @@ an array alias on a subpattern flattens and collects all nested subpattern captures within the aliased subpattern. For example: - if mm/ $<pairs>:=( (\w+) \: (\N+) )+ / { + if mm/ $pairs=( (\w+) \: (\N+) )+ / { # Scalar alias, so $/<pairs> is assigned an array # of Match objects, each of which has its own array # of two subcaptures... @@ -2945,7 +2939,7 @@ } - if mm/ @<pairs>:=( (\w+) \: (\N+) )+ / { + if mm/ @pairs=( (\w+) \: (\N+) )+ / { # Array alias, so $/<pairs> is assigned an array # of Match objects, each of which is flattened out of # the two subcaptures within the subpattern @@ -2965,7 +2959,7 @@ rule pair { (\w+) \: (\N+) \n } - if mm/ $<pairs>:=<pair>+ / { + if mm/ $pairs=<pair>+ / { # Scalar alias, so $/<pairs> contains an array of # Match objects, each of which is the result of the # <pair> subrule call... @@ -2977,7 +2971,7 @@ } - if mm/ mv @<pairs>:=<pair>+ / { + if mm/ mv @pairs=<pair>+ / { # Array alias, so $/<pairs> contains an array of # Match objects, all flattened down from the # nested arrays inside the Match objects returned @@ -3004,13 +2998,13 @@ appropriate element of the regex's match array rather than to a key of its match hash. For example: - if m/ mv \s+ @0:=((\w+) \s+)+ $1:=((\W+) (\s*))* / { - # | | - # | | - # | \_ Scalar alias, so $1 gets an - # | array, with each element - # | a Match object containing - # | the two nested captures + if m/ mv \s+ @0=((\w+) \s+)+ $1=((\W+) (\s*))* / { + # | | + # | | + # | \_ Scalar alias, so $1 gets an + # | array, with each element + # | a Match object containing + # | the two nested captures # | # \___ Array alias, so $0 gets a flattened array of # just the (\w+) captures from each repetition @@ -3040,7 +3034,7 @@ An alias can also be specified using a hash as the alias variable, instead of a scalar or an array. For example: - m/ mv %<location>:=( (<ident>) \: (\N+) )+ /; + m/ mv %location=( (<ident>) \: (\N+) )+ /; =item * @@ -3062,7 +3056,7 @@ rule one_to_many { (\w+) \: (\S+) (\S+) (\S+) } - if mm/ %0:=<one_to_many>+ / { + if mm/ %0=<one_to_many>+ / { # $/[0] contains a hash, in which each key is provided by # the first subcapture within C<one_to_many>, and each # value is an array containing the @@ -3094,11 +3088,11 @@ Instead of using internal aliases like: - m/ mv @<files>:=<ident>+ $<dir>:=<ident> / + m/ mv @files=<ident>+ $dir=<ident> / the name of an ordinary variable can be used as an I<external> alias, like so: - m/ mv @files:=<ident>+ $dir:=<ident> / + m/ mv @files=<ident>+ $dir=<ident> / =item * @@ -3185,10 +3179,10 @@ the angles is used as part of the key. Suppose the earlier example parsed whitespace: - / <key> <+ws> '=>' <+ws> <value> { %hash{$<key>} = $<value> } / + / <key> <.ws> '=>' <.ws> <value> { %hash{$key} = $value } / -The two instances of C<< <+ws> >> above would store an array of two -values accessible as C<< @<+ws> >>. It would also store the literal +The two instances of C<< <.ws> >> above would store an array of two +values accessible as C<< @<.ws> >>. It would also store the literal match into C<< $<'=\>'> >>. Just to make sure nothing is forgotten, under C<:keepall> any text or whitespace not otherwise remembered is attached as an extra property on the subsequent node. (The name of @@ -3251,20 +3245,20 @@ grammar Letter { rule text { <greet> <body> <close> } - rule greet { [Hi|Hey|Yo] $<to>:=(\S+?) , $$} + rule greet { [Hi|Hey|Yo] $to=(\S+?) , $$} rule body { <line>+? } # note: backtracks forwards via +? - rule close { Later dude, $<from>:=(.+) } + rule close { Later dude, $from=(.+) } # etc. } grammar FormalLetter is Letter { - rule greet { Dear $<to>:=(\S+?) , $$} + rule greet { Dear $to=(\S+?) , $$} - rule close { Yours sincerely, $<from>:=(.+) } + rule close { Yours sincerely, $from=(.+) } }