On Wed, 28 Aug 2002, Steve Fink wrote:

> On Wed, Aug 28, 2002 at 12:55:44PM -0400, Deven T. Corzine wrote:
> > On Wed, 28 Aug 2002, Dan Sugalski wrote:
> > > At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote:
> > 
> > On the other hand, :, ::, ::: and <commit> don't necessarily need to be a 
> > problem if they can be treated as hints that can be ignored.  If allowing 
> > the normal engine to backtrack despite the hints would change the results, 
> > that might be a problem.  I don't know; <cut> may pose special problems.
> 
> They do change the semantics.
> 
>  (June|Jun) ebug
> 
> matches the string "Junebug", but
> 
>  (June|Jun): ebug
> 
> does not. Similarly,
> 
>  (June|Jun) ebug
>  (June|Jun) :: ebug
>  (June|Jun) ::: ebug
>  (June|Jun) <commit> ebug
> 
> all behave differently when embedded in a larger grammar.

Good point.  Okay, they definitely change the semantics.  Still, could such 
semantics be implemented in a non-backtracking state machine, whether or 
not it's a strict DFA?

> However, it is very possible that in many (the majority?) of actual
> uses, they may be intended purely as optimizations and so any
> differing behavior is unintentional. It may be worth having a flag
> noting that (maybe combined with a warning "you said this :: isn't
> supposed to change what can match, but it does.")

I think this is like the leftmost matching semantics -- it may exist for 
the sake of implementation efficiency, yet it has semantic consequences as 
well.  In many cases, those semantic differences may be immaterial, yet 
some code relies on it.  Allowing flags to specify that such differences in
semantics are immaterial to your pattern would be helpful.  (Would it make 
sense for one flag to say "don't care" to the semantic differences for BOTH 
leftmost matching and the :/::/:::/etc. operators?)

> > I believe there are many subpatterns which might be beneficial to compile 
> > to a DFA (or DFA-like) form, where runtime performance is important.  For 
> > example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern
> > would be more efficient to implement as a DFA than with backtracking.  With 
> > a large amount of data to process, that represent significant savings...
> 
> I agree. I will reply to this in perl6-internals, though.

Yes, the discussion of details belongs there, when it's not infringing on 
issues of language design, as the semantic consequences do...

> > (2) Would simple alternation impede DFA-style optimizations?
> > 
> > Currently, a pattern like (Jun|June) would never match "June" because the 
> > "leftmost" match "Jun" would always take precedence, despite the normal 
> > "longest-match" behavior of regexes in general.  This example could be 
> > implemented in a DFA; would that always be the case?
> 
> You should read Friedl's Mastering Regular Expressions, if you haven't
> already. A POSIX NFA would be required to find the longest match (it
> has to work as if it were a DFA). A "traditional" NFA produces what
> would result from the straightforward backtracking implementation,
> which often gives an answer closer to what the user expects. IMO,
> these are often different, and the DFA would surprise users fairly
> often.

Rather than following a traditional approach to NFA/DFA construction, would 
it be possible to use a modified approach that preserves leftmost matching?
(If so, would it be more expensive or just different?)

> > Would it be better for the matching of (Jun|June) to be "undefined" and 
> > implementation-dependent?  Or is it best to require "leftmost" semantics?
> 
> I'm voting "leftmost", because I've frequently seen people depend on it.

I was going to agree, until I read your next paragraph.

However, it would be useful to be able to say "don't care" to the semantic 
distinction -- it might even be useful to be able to demand longest-match 
take precedence over leftmost matching, but that would incur an extra cost 
in the normal regex engine...

> I'm not so sure that Larry's suggestion of adding a :dfa flag is
> always the right approach, because I think this might actually be
> something you'd want to set for subsets of a grammar or a single
> expression. I don't think it's useful enough to go as far as proposing
> that || mean "alternate without defining the order of preference", but
> perhaps some <angle-bracketed thing> would work. (Or can you embed
> flags in expressions, like perl5's (?imsx:R) thing? Then the :dfa flag
> is of course adequate!)

You know, when you bring up an idea like "||", I start thinking that maybe 
the default should be NOT to have a preference (since it normally doesn't 
matter) and to only guarantee the leftmost short-circuit behavior with "||" 
instead of "|".  That would allow for more implementation flexibility, and 
provide a beautiful parallel with C -- in C, only "||" short-circuits and 
the "|" operator still evaluates all parts.  (Granted, that's because it's 
bitwise, but there's still a nice parallel there.)

For the few cases where someone wants to rely on leftmost matching of any 
alternation, they could simply use "||" instead of "|" for the pattern, 
which wouldn't be much of a burden.  Then "|" could default to "don't care" 
alternation.  This also fits nicely of the principle of making the common 
case (don't care) shorter than the special case (needs leftmost matching).

Of course, I'm sure everyone will be violently opposed to this idea. :-)

Deven

Reply via email to