[svn:perl6-synopsis] r13557 - doc/trunk/design/syn

larry Wed, 31 Jan 2007 09:38:00 -0800

Author: larry
Date: Wed Jan 31 10:37:46 2007
New Revision: 13557

Modified:
   doc/trunk/design/syn/S05.pod


Log:
Clean up some of the language to avoid confusing automata terminology.
Changed token keyword not to terminate token autodeclaration on whitespace,
so it's now possible to specify a token containing whitespace.


Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod        (original)
+++ doc/trunk/design/syn/S05.pod        Wed Jan 31 10:37:46 2007
@@ -14,9 +14,9 @@
    Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
                Larry Wall <[EMAIL PROTECTED]>
    Date: 24 Jun 2002
-   Last Modified: 27 Jan 2007
+   Last Modified: 31 Jan 2007
    Number: 5
-   Version: 47
+   Version: 48
 
 This document summarizes Apocalypse 5, which is about the new regex
 syntax.  We now try to call them I<regex> rather than "regular
@@ -68,7 +68,9 @@
 =back
 
 While the syntax of C<|> does not change, the default semantics do
-change slightly.  See the section below on "Longest-token matching".
+change slightly.  We are attempting to concoct a pleasing mixture
+of declarative and procedural matching so that we can have the
+best of both.  See the section below on "Longest-token matching".
 
 =head1 Modifiers
 
@@ -1519,14 +1521,14 @@
 =head1 Longest-token matching
 
 Instead of representing temporal alternation, C<|> now represents
-logical alternation with longest-token semantics.  (You may now use
-C<||> to indicate the old temporal alternation.  That is, C<|> and
-C<||> now work within regex syntax much the same as they do outside
+logical alternation with declarative longest-token semantics.  (You may
+now use C<||> to indicate the old temporal alternation.  That is, C<|>
+and C<||> now work within regex syntax much the same as they do outside
 of regex syntax, where they represent junctional and short-circuit OR.
 This includes the fact that C<|> has tighter precedence than C<||>.)
 
-Historically regex processing has proceeded in Perl via an NFA
-algorithm.  This is quite powerful, but many parsers work more
+Historically regex processing has proceeded in Perl via a backtracking
+NFA algorithm.  This is quite powerful, but many parsers work more
 efficiently by processing rules in parallel rather than one after
 another, at least up to a point.  If you look at something like a
 yacc grammar, you find a lot of pattern/action declarations where the
@@ -1544,7 +1546,8 @@
 list of initial token patterns (transitively including the token
 patterns of any subrule called by the "pure" part of that regex, but
 not including any subrule more than once, since that would involve
-self reference).  A logical alternation using C<|> then takes two or
+self reference, which is not allowed in traditional regular
+expressions).  A logical alternation using C<|> then takes two or
 more of these lists and dispatches to the alternative that matches
 the longest token prefix.  This may or may not be the alternative
 that comes first lexically.  (However, in the case of a tie between
@@ -1559,7 +1562,7 @@
 it is finished with the pattern part and starting in on the side effects,
 so by inserting such constructs the user controls what is considered
 a token and what is not.  The constructs deemed to terminate a token
-pattern and start the "action" part of the pattern include:
+declaration and start the "action" part of the pattern include:
 
 =over
 
@@ -1574,8 +1577,10 @@
 
 =item *
 
-Any part of the regex that might match whitespace, including whitespace
-implicitly matched via C<:sigspace>.
+Any part of the regex or rule that I<might> match whitespace,
+including whitespace implicitly matched via C<:sigspace>.  (However,
+token declarations are specifically allowed to recognize whitespace
+within a token.)
 
 =item *
 
@@ -1584,17 +1589,18 @@
 =back
 
 Subpatterns (captures) specifically do not terminate the token pattern,
-but may require a reparse of the token via NFA to find the location
+but may require a reparse of the token to find the location
 of the subpatterns.  Likewise assertions may need to be checked out
-after the longest token is determined.  (Alternately DFA semantics
-may be simulated in any of various ways.)
+after the longest token is determined.  (Alternately, if DFA semantics
+are simulated in any of various ways, such as by Thompson NFA, it may
+be possible to know when to fire off the assertions without backchecks.)
 
 Ordinary quantifiers and characters classes do not terminate a token pattern.
 Zero-width assertions such as word boundaries also okay.
 
 Oddly enough, the C<token> keyword specifically does not determine
 the scope of a token, except insofar as a token pattern usually
-doesn't do any matching of whitespace.  In contrast, the C<rule>
+doesn't do much matching of whitespace.  In contrast, the C<rule>
 keyword (which assumes C<:sigspace>) defines a pattern that tends
 to disqualify itself on the first whitespace.  So most of the token
 patterns will end up coming from C<token> declarations.  For instance,
@@ -1605,11 +1611,11 @@
 considers its "longest token" to be just the left square bracket, because
 the first thing the C<expr> rule will do is traverse optional whitespace.
 
-Initial tokens must take into account case sensitivity (or any other
-canonicalization primitives) and do the right thing even when propagated
-up to rules that don't have the same canonicalization.  That is, they
-must continue to represent the set of matches that the lower rule would
-match.
+The initial token matcher must take into account case sensitivity
+(or any other canonicalization primitives) and do the right thing even
+when propagated up to rules that don't have the same canonicalization.
+That is, they must continue to represent the set of matches that the
+lower rule would match.
 
 The C<||> form has the old short-circuit semantics, and will not
 attempt to match its right side unless all possibilities (including
@@ -1618,9 +1624,9 @@
 outer longest-token matcher, but hides any subsequent tests from
 longest-token matching.  Every C<||> establishes a new longest-token
 matcher.  That is, if you use C<|> on the right side of C<||>, that
-right side establishes a new top level DFA for longest-token processing
+right side establishes a new top level scope for longest-token processing
 for this subexpression and any called subrules.  The right side's
-longest-token list is invisible to the left of the C<||> or outside
+longest-token automaton is invisible to the left of the C<||> or outside
 the regex containing the C<||>.
 
 =head1 Return values from matches

[svn:perl6-synopsis] r13557 - doc/trunk/design/syn

Reply via email to