On Wed, 28 Aug 2002, Steve Fink wrote: > On Wed, Aug 28, 2002 at 12:55:44PM -0400, Deven T. Corzine wrote: > > On Wed, 28 Aug 2002, Dan Sugalski wrote: > > > At 10:57 AM -0400 8/28/02, Deven T. Corzine wrote: > > > > On the other hand, :, ::, ::: and <commit> don't necessarily need to be a > > problem if they can be treated as hints that can be ignored. If allowing > > the normal engine to backtrack despite the hints would change the results, > > that might be a problem. I don't know; <cut> may pose special problems. > > They do change the semantics. > > (June|Jun) ebug > > matches the string "Junebug", but > > (June|Jun): ebug > > does not. Similarly, > > (June|Jun) ebug > (June|Jun) :: ebug > (June|Jun) ::: ebug > (June|Jun) <commit> ebug > > all behave differently when embedded in a larger grammar.
Good point. Okay, they definitely change the semantics. Still, could such semantics be implemented in a non-backtracking state machine, whether or not it's a strict DFA? > However, it is very possible that in many (the majority?) of actual > uses, they may be intended purely as optimizations and so any > differing behavior is unintentional. It may be worth having a flag > noting that (maybe combined with a warning "you said this :: isn't > supposed to change what can match, but it does.") I think this is like the leftmost matching semantics -- it may exist for the sake of implementation efficiency, yet it has semantic consequences as well. In many cases, those semantic differences may be immaterial, yet some code relies on it. Allowing flags to specify that such differences in semantics are immaterial to your pattern would be helpful. (Would it make sense for one flag to say "don't care" to the semantic differences for BOTH leftmost matching and the :/::/:::/etc. operators?) > > I believe there are many subpatterns which might be beneficial to compile > > to a DFA (or DFA-like) form, where runtime performance is important. For > > example, if a pattern is matching dates, a (Jan|Feb|Mar|Apr|...) subpattern > > would be more efficient to implement as a DFA than with backtracking. With > > a large amount of data to process, that represent significant savings... > > I agree. I will reply to this in perl6-internals, though. Yes, the discussion of details belongs there, when it's not infringing on issues of language design, as the semantic consequences do... > > (2) Would simple alternation impede DFA-style optimizations? > > > > Currently, a pattern like (Jun|June) would never match "June" because the > > "leftmost" match "Jun" would always take precedence, despite the normal > > "longest-match" behavior of regexes in general. This example could be > > implemented in a DFA; would that always be the case? > > You should read Friedl's Mastering Regular Expressions, if you haven't > already. A POSIX NFA would be required to find the longest match (it > has to work as if it were a DFA). A "traditional" NFA produces what > would result from the straightforward backtracking implementation, > which often gives an answer closer to what the user expects. IMO, > these are often different, and the DFA would surprise users fairly > often. Rather than following a traditional approach to NFA/DFA construction, would it be possible to use a modified approach that preserves leftmost matching? (If so, would it be more expensive or just different?) > > Would it be better for the matching of (Jun|June) to be "undefined" and > > implementation-dependent? Or is it best to require "leftmost" semantics? > > I'm voting "leftmost", because I've frequently seen people depend on it. I was going to agree, until I read your next paragraph. However, it would be useful to be able to say "don't care" to the semantic distinction -- it might even be useful to be able to demand longest-match take precedence over leftmost matching, but that would incur an extra cost in the normal regex engine... > I'm not so sure that Larry's suggestion of adding a :dfa flag is > always the right approach, because I think this might actually be > something you'd want to set for subsets of a grammar or a single > expression. I don't think it's useful enough to go as far as proposing > that || mean "alternate without defining the order of preference", but > perhaps some <angle-bracketed thing> would work. (Or can you embed > flags in expressions, like perl5's (?imsx:R) thing? Then the :dfa flag > is of course adequate!) You know, when you bring up an idea like "||", I start thinking that maybe the default should be NOT to have a preference (since it normally doesn't matter) and to only guarantee the leftmost short-circuit behavior with "||" instead of "|". That would allow for more implementation flexibility, and provide a beautiful parallel with C -- in C, only "||" short-circuits and the "|" operator still evaluates all parts. (Granted, that's because it's bitwise, but there's still a nice parallel there.) For the few cases where someone wants to rely on leftmost matching of any alternation, they could simply use "||" instead of "|" for the pattern, which wouldn't be much of a burden. Then "|" could default to "don't care" alternation. This also fits nicely of the principle of making the common case (don't care) shorter than the special case (needs leftmost matching). Of course, I'm sure everyone will be violently opposed to this idea. :-) Deven