Author: larry Date: Wed Jan 31 10:37:46 2007 New Revision: 13557 Modified: doc/trunk/design/syn/S05.pod
Log: Clean up some of the language to avoid confusing automata terminology. Changed token keyword not to terminate token autodeclaration on whitespace, so it's now possible to specify a token containing whitespace. Modified: doc/trunk/design/syn/S05.pod ============================================================================== --- doc/trunk/design/syn/S05.pod (original) +++ doc/trunk/design/syn/S05.pod Wed Jan 31 10:37:46 2007 @@ -14,9 +14,9 @@ Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and Larry Wall <[EMAIL PROTECTED]> Date: 24 Jun 2002 - Last Modified: 27 Jan 2007 + Last Modified: 31 Jan 2007 Number: 5 - Version: 47 + Version: 48 This document summarizes Apocalypse 5, which is about the new regex syntax. We now try to call them I<regex> rather than "regular @@ -68,7 +68,9 @@ =back While the syntax of C<|> does not change, the default semantics do -change slightly. See the section below on "Longest-token matching". +change slightly. We are attempting to concoct a pleasing mixture +of declarative and procedural matching so that we can have the +best of both. See the section below on "Longest-token matching". =head1 Modifiers @@ -1519,14 +1521,14 @@ =head1 Longest-token matching Instead of representing temporal alternation, C<|> now represents -logical alternation with longest-token semantics. (You may now use -C<||> to indicate the old temporal alternation. That is, C<|> and -C<||> now work within regex syntax much the same as they do outside +logical alternation with declarative longest-token semantics. (You may +now use C<||> to indicate the old temporal alternation. That is, C<|> +and C<||> now work within regex syntax much the same as they do outside of regex syntax, where they represent junctional and short-circuit OR. This includes the fact that C<|> has tighter precedence than C<||>.) -Historically regex processing has proceeded in Perl via an NFA -algorithm. This is quite powerful, but many parsers work more +Historically regex processing has proceeded in Perl via a backtracking +NFA algorithm. This is quite powerful, but many parsers work more efficiently by processing rules in parallel rather than one after another, at least up to a point. If you look at something like a yacc grammar, you find a lot of pattern/action declarations where the @@ -1544,7 +1546,8 @@ list of initial token patterns (transitively including the token patterns of any subrule called by the "pure" part of that regex, but not including any subrule more than once, since that would involve -self reference). A logical alternation using C<|> then takes two or +self reference, which is not allowed in traditional regular +expressions). A logical alternation using C<|> then takes two or more of these lists and dispatches to the alternative that matches the longest token prefix. This may or may not be the alternative that comes first lexically. (However, in the case of a tie between @@ -1559,7 +1562,7 @@ it is finished with the pattern part and starting in on the side effects, so by inserting such constructs the user controls what is considered a token and what is not. The constructs deemed to terminate a token -pattern and start the "action" part of the pattern include: +declaration and start the "action" part of the pattern include: =over @@ -1574,8 +1577,10 @@ =item * -Any part of the regex that might match whitespace, including whitespace -implicitly matched via C<:sigspace>. +Any part of the regex or rule that I<might> match whitespace, +including whitespace implicitly matched via C<:sigspace>. (However, +token declarations are specifically allowed to recognize whitespace +within a token.) =item * @@ -1584,17 +1589,18 @@ =back Subpatterns (captures) specifically do not terminate the token pattern, -but may require a reparse of the token via NFA to find the location +but may require a reparse of the token to find the location of the subpatterns. Likewise assertions may need to be checked out -after the longest token is determined. (Alternately DFA semantics -may be simulated in any of various ways.) +after the longest token is determined. (Alternately, if DFA semantics +are simulated in any of various ways, such as by Thompson NFA, it may +be possible to know when to fire off the assertions without backchecks.) Ordinary quantifiers and characters classes do not terminate a token pattern. Zero-width assertions such as word boundaries also okay. Oddly enough, the C<token> keyword specifically does not determine the scope of a token, except insofar as a token pattern usually -doesn't do any matching of whitespace. In contrast, the C<rule> +doesn't do much matching of whitespace. In contrast, the C<rule> keyword (which assumes C<:sigspace>) defines a pattern that tends to disqualify itself on the first whitespace. So most of the token patterns will end up coming from C<token> declarations. For instance, @@ -1605,11 +1611,11 @@ considers its "longest token" to be just the left square bracket, because the first thing the C<expr> rule will do is traverse optional whitespace. -Initial tokens must take into account case sensitivity (or any other -canonicalization primitives) and do the right thing even when propagated -up to rules that don't have the same canonicalization. That is, they -must continue to represent the set of matches that the lower rule would -match. +The initial token matcher must take into account case sensitivity +(or any other canonicalization primitives) and do the right thing even +when propagated up to rules that don't have the same canonicalization. +That is, they must continue to represent the set of matches that the +lower rule would match. The C<||> form has the old short-circuit semantics, and will not attempt to match its right side unless all possibilities (including @@ -1618,9 +1624,9 @@ outer longest-token matcher, but hides any subsequent tests from longest-token matching. Every C<||> establishes a new longest-token matcher. That is, if you use C<|> on the right side of C<||>, that -right side establishes a new top level DFA for longest-token processing +right side establishes a new top level scope for longest-token processing for this subexpression and any called subrules. The right side's -longest-token list is invisible to the left of the C<||> or outside +longest-token automaton is invisible to the left of the C<||> or outside the regex containing the C<||>. =head1 Return values from matches