Re: Regular expressions and closures (was: Perl 6 Summary for 2004-10-01 through 2004-10-17)

Aaron Sherman Tue, 26 Oct 2004 19:26:47 -0700

On Tue, 2004-10-26 at 20:16, Larry Wall wrote:
> On Tue, Oct 26, 2004 at 01:42:02PM -0400, Aaron Sherman wrote:


> :     /ab(c|b){$1 eq 'c'}/
> : 
> : If I recall correctly you had said something like, "there is no plan
> : (yet) to allow embedded closures to affect matching directly, other than
> : failing outright." This implied (to me) that you were thinking along
> : those lines, but were not sure if the regex engine would support it.
> 
> I think you might have misunderstood me slightly.  I was primarily
> quibbling about your syntax, insofar as the above won't work because
> the syntax of a bare closure indicates that the return value is to
> be ignored, so it's normally used *only* for its side effects (in the
> absence of a failure exception).

Ok, that makes sense.

>   The current syntax for what you're
> trying to write is:
> 
>       /ab(c|b) <($1 eq 'c')>/
> 
> which is equivalent to
> 
>       /ab(c|b) {fail unless $1 eq 'c'}/

Now, what does "fail" mean? I can think of two definitions:

        1. proceed to trap state (backtracking then happens)
        2. exit (probably using an exception) the rule?

> : My work so far, and discussions with friends suggest that my scheme for
> : building an FSA with attached continuations will work as long as I place
> : some heavy restrictions on the behavior and side-effects of the closure.
> 
> I suppose the usefulness of that will depend on what you mean by
> "heavy restrictions".  But we have to have general side effects or
> we can't build up our syntax tree.

I'm sorry. By that I mean that you will not be allowed to change the
structure or behavior of the FSA in ways which are not consistent with
what was known before execution began. That is, you can't say "goto
state 10", you can only say, "pretend you just matched an 'a'". Now, if
you're thinking in terms of a simple regex that sounds strange, but if
you're thinking in terms of building an FSA that goes to state 10 on
matching a "zero-width 'a'", then you've just gotten the same value out
of the closure as if it *could* "goto state 10". Sort of the "all's fair
if you predeclare" approach to regular expressions and FSAs in general.

> :  State 1:
> :             match(token=>'a', target => 2),
> :             match(token=>!'a', target => trap)
> 
> Does this notation imply that you're matching against 'a' twice?

No, it implies that there are two comparisons to be done, one for the
set 'a' and one for the set !'a'. My formal set notation skills suck,
pardon the poor clarity.

Since the back-end is pluggable, you can't say how that will happen, but
the default back-end will test each set serially until it finds a match.

> :  State 3 (start subexp mark, end subexp mark):
> :             match(token => 'c', target => 4),
> :             match(token => 'b', target => 4),
> :             match(token => !['c','b'], target => trap)
> 
> I find this notation confusing.  In Perl 5 regex, the start and end are
> considered separate states, and the end state would follow the matches,
> or you wouldn't capture the "motion", and $1 would end up null.  But
> then, I think in NFA, not DFA...

NDFAs (or NFAs as you say) should be the same as DFAs for this purpose.
I think it's just a difference in the way we look at the token-matching
process. I see it as part of the state, so to say "we both begin and end
subexpression storage here" is to say, "only the token matched here will
be stored as the subexpression." This restricts you, though, and I may
find that in order to say:

        (a)((b)c)

I need too much hair and want to put the markers elsewhere.... I'll know
that by next week.

> :  State 4 (closure {$1 eq 'c'}):
> :             match(token => true, target => 5)
> :             match(token => false, target => trap)
> 
> It's not the regex that would be evaluating for truth here.  It only needs
> to make sure it can trap a failure.

The way it will trap failure will be to check the return value. If that
means that the continuation that it stores is a standard wrapper that
invokes the Perl-level closure and catches exceptions or some other
mechanism, that's fine with me, and I consider that a compiler issue to
be addressed later.

I'm pretty sure I want the return value to be an artificial token, as
the input is the only state you can change without affecting the
integrity of the FSA. (this is where finite state purists come hunting
for me with pitchforks and torches)

> :     start subexp mark: boolean: start saving with this token
> :     end subexp mark: boolean: stop saving after this token
> :     closure: continuation: sub(RXContext $cx) returns(RXToken)
> 
> We also have the <{ $subregex }> form that has to be able to evaluate
> a rule returned by a closure.

I think evaluating a sub-rule will be a fairly simple matter, and in
that particular case, the continuation stored in with the state would
wrap the sub-rule invocation and manage it for you. I have not stopped
to think about that too much, so I might be missing something big. For
example, I've not tried to think in terms of what style of recursion p6
rules will have to follow, but I'm 90% convinced that no matter how I
think about it, a solution can be engineered in the compiler given the
tools that I'm providing.

Again, though, that's mostly a compiler issue, and I don't want to get
too far into that (I'll probably post a long bit to p6c once I get the
core engine in place).

> : For a state to be equivalent to another state, it must satisfy:
> : 
> :     state1.terminal == state2.terminal
> :     state1.start_mark == state2.start_mark
> :     state1.end_mark == state2.end_mark
> :     state1.closure == state2.closure
> : 
> : and, of course, the normal recursive comparison of all transitions and
> : the states they target.
> 
> I think you also have to consider *which* start and end you're saving here.
> Don't want to mix up your paren counts...

I don't think I do. It *will* be wasteful to have empty states just to
open and/or close subexpressions, but only in terms of the FSA
representation, not the generated code (which presumably would optimize
away the empty states where needed).

By empty I mean:

        State 1 (start subexp mark):
                match(token => all, target => 2, consume => false)
        State 2 (start subexp mark):
                match(token => all, target => 3, consume => false)
        and so-on...

> Also have to allow saves to named variables.

I think that will require some more thought.... I'm not sure yet how to
deal with that case, but it will have to be tightly tied to numbered
subexpressions somehow.... hmmm.

> : This allows for transformation and minimization of the FSA as well as
> : arbitrary code execution. The obvious drawback is something that you've
> : pointed out several times: the code may be executed many times, not at
> : all or may be executed and then backtracked over (and then re-executed
> : or not). This makes side-effects dangerous and unpredictable, but I
> : think that's OK.
> 
> What I have said is that the ordinary side-effect form (bare closure)
> must run at the canonical time for its location in the regex, even
> if that disables certain optimizations.  The conditional form and
> the regex returning form may be optimized away.

I may not have followed you there. Are you saying that:

        "abc" ~~ rx/ab{print "Hello"}b | ab{print "World"}c/

would print "Hello", "World", or "HelloWorld"? Or are you just referring
to the fact that although the NFA view of the world tells us that it
could say, "WorldHelloWorldHelloWorld", it will not?

> Um, that's what UNDO already is supposed to do.  Closures embedded
> in rules logically return when backtracked over, and UNDO is already
> set up to roll back on failure.

Great... so I just have to have some vehicle to allow for that... I'll
be right with you ;-)

> Perl allows you to return only
> 
>     success (void context)
>     failure exception
>     subrule to match

I'm going to think about this bit some more too... I think I'm not yet
fully following, but that's because I'm slow ;-)

> : but the FSA itself will allow a closure to
> : "match" any token (as if it had been in the input stream) by returning
> : it (the token is consumed, but is translated to the empty string for
> : purposes of sub-expressions and other "memory").
> 
> Matching one token seems relatively useless to me.  A closure typically
> wants to match either 0 tokens or a sequence of tokens.

Hmm... I think we're saying different things.

I'm talking about how the closure communicates its desires to the FSA,
not how it thinks about the input. I don't think this is useful to
Perl-like regular expression languages, but it's a superset of what is
needed for any sort of flow control from inside a closure without
destroying the nature of the FSA (e.g. that it has a fixed and finite
set of states and transitions). It also allows for minimal special-cases
for Perl 6. Ideally I'd like Perl 6 to simply be a special-case of the
generic finite-pattern system I'm building, not the other way around.

> : Its important to keep
> : this separate in concept from actually causing the FSA to transition to
> : a target state (which the closure cannot do).
> 
> The closure must be able to change the current "pos" at least.  I don't
> think it necessarily has to be able to pick the next state.  But it does
> have to have access to the current $0 match state object.

Hrm... I'm not sure what changing the current "pos" would mean... I can
see forcing backtracking, but that's implicit in failure (failure is the
wrong way to think of it... it's really injecting a zero-width character
that the existing FSA just happens to transition to a trap state on).

> Be sure to keep in touch with Patrick and Luke, who are coming at this
> from the direction of the Perl 6 parser.

As soon as I have a reference implementation that just does my simple
regexp syntax and has a simple, if stupid, back-end then I think I can
really help out the various compiler camps. Right now, I have no idea if
I have anything to offer (or if, for example, Perl 6 would be better off
rolling its own)... all of which is why I'm writing to p6l right now. I
consider this all a theoretical discussion of the value of my approach
visa vis Perl 6's definition of a rule.

-- 
â 781-324-3772
â [EMAIL PROTECTED]
â http://www.ajs.com/~ajs

Re: Regular expressions and closures (was: Perl 6 Summary for 2004-10-01 through 2004-10-17)

Reply via email to