Author: larry
Date: Wed Mar 19 09:39:02 2008
New Revision: 14525
Modified:
doc/trunk/design/syn/S05.pod
Log:
Add <*abc> form for sequential optional characters
Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod (original)
+++ doc/trunk/design/syn/S05.pod Wed Mar 19 09:39:02 2008
@@ -14,9 +14,9 @@
Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
Larry Wall <[EMAIL PROTECTED]>
Date: 24 Jun 2002
- Last Modified: 17 Mar 2008
+ Last Modified: 19 Mar 2008
Number: 5
- Version: 74
+ Version: 75
This document summarizes Apocalypse 5, which is about the new regex
syntax. We now try to call them I<regex> rather than "regular
@@ -1145,32 +1145,6 @@
=item *
-The special named assertions include:
-
- / <?before pattern> / # lookahead
- / <?after pattern> / # lookbehind
-
- / <?same> / # true between two identical characters
-
- / <.ws> / # match "whitespace":
- # \s+ if it's between two \w characters,
- # \s* otherwise
-
- / <?at($pos)> / # match only at a particular StrPos
- # short for <?{ .pos === $pos }>
- # (considered declarative until $pos changes)
-
-The C<after> assertion implements lookbehind by reversing the syntax
-tree and looking for things in the opposite order going to the left.
-It is illegal to do lookbehind on a pattern that cannot be reversed.
-
-Note: the effect of a forward-scanning lookbehind at the top level
-can be achieved with:
-
- / .*? prestuff <( mainpat )> /
-
-=item *
-
A leading C<.> causes a named assertion not to capture what it matches (see
L<Subrule captures>. For example:
@@ -1225,7 +1199,8 @@
This assertion is not automatically captured.
As with bare hash, the longest key matches according to the venerable
-I<longest-token rule>.
+I<longest-token rule>. [Conjecture: <%foo> may not be supported in 6.0, or
+may be retargeted to matching an abbreviation table.]
=item *
@@ -1366,6 +1341,90 @@
<.alpha> # match a letter, don't capture
<?alpha> # match null before a letter, don't capture
+The special named assertions include:
+
+ / <?before pattern> / # lookahead
+ / <?after pattern> / # lookbehind
+
+ / <?same> / # true between two identical characters
+
+ / <.ws> / # match "whitespace":
+ # \s+ if it's between two \w characters,
+ # \s* otherwise
+
+ / <?at($pos)> / # match only at a particular StrPos
+ # short for <?{ .pos === $pos }>
+ # (considered declarative until $pos changes)
+
+The C<after> assertion implements lookbehind by reversing the syntax
+tree and looking for things in the opposite order going to the left.
+It is illegal to do lookbehind on a pattern that cannot be reversed.
+
+Note: the effect of a forward-scanning lookbehind at the top level
+can be achieved with:
+
+ / .*? prestuff <( mainpat )> /
+
+=item *
+
+A leading C<*> indicates that the following pattern allows a
+partial match. It always succeeds after matching as many characters
+as possible. (It is not zero-width unless 0 characters match.)
+For instance, to match a number of abbreviations, you might write
+any of:
+
+ s/ ^ G<*n|enesis> $ /gen/ or
+ s/ ^ Ex<*odos> $ /ex/ or
+ s/ ^ L<*v|eviticus> $ /lev/ or
+ s/ ^ N<*m|umbers> $ /num/ or
+ s/ ^ D<*t|euronomy> $ /deut/ or
+ ...
+
+ / (<* <foo bar baz> >) /
+
+ / <[EMAIL PROTECTED]> / and return %long{$<short>} || $<short>;
+
+The pattern is restricted to declarative forms that can be rewritten
+as nested optional character matches. Sequence information
+may not be discarded while making all following characters optional.
+That is, it is not sufficient to rewrite:
+
+ <*xyz>
+
+as:
+
+ x? y? z? # bad, would allow xz
+
+Instead, it must be implemented as:
+
+ [x [y z?]?]? # allow only x, xy, xyz (and '')
+
+Explicit quantifiers are allowed on single characters, so this:
+
+ <* a b+ c | ax*>
+
+is rewritten as something like:
+
+ [a [b+]? c?]? | [a x*]?
+
+In the latter example we're assuming the DFA token matcher is going to
+give us the longest match regardless. It's also possible that quantified
+multichar sequences can be recursively remapped:
+
+ <* 'ab'+> # match a, ab, ababa, etc. (but not aab!)
+ ==> [ 'ab'* <*ab> ]
+ ==> [ 'ab'* [a b?]? ]
+
+[Conjecture: depending on how fancy we get, we might (or might not)
+be able to autodetect ambiguities in C<< <[EMAIL PROTECTED]> >> and refuse to
+generate ambiguous abbreviations (although exact match of a shorter
+abbrev should always be allowed even if it's the prefix of a longer
+abbreviation). If it is not possible, then the user will have to
+check for ambiguities after the match. Note also that the array
+form is assuming the array doesn't change often. If it does, the
+longest-token matcher has to be recalculated, which could get
+expensive.]
+
=item *
A leading C<~~> indicates a recursive call back into some or all of