On Wed, 28 Aug 2002, Larry Wall wrote:

> That is a worthy consideration, but expressiveness takes precedence
> over it in this case.

I see nothing wrong with expressiveness taking precedence -- I'm only 
saying that it would be best to be cognizant of any restrictions we're 
hardcoding into the design (in case there's a less restrictive means to the 
same ends) and make that design tradeoff knowingly rather than by default.

If we can find general solutions that don't demand a particular style of 
implementation, that's probably an improvement.  There may be unavoidable 
cases, in which case we decide to accept the limitation for expressiveness, 
and that's a perfectly reasonable design choice.

I'd just hate to ignore the issue now, and have someone later say "here's a 
great way it could have been done that would have allowed this improvement 
in the implementation"...

> DFAs are really only good for telling you *whether* and *where* a pattern 
> matches as a whole.  They are relatively useless for telling you *how* a 
> pattern matches.  For instance, a DFA can tell you that you have a valid 
> computer program, but can't hand you back the syntax tree, because it has
> no way to decide between shifting and reducing.  It has to do both
> simultaneously.

Yes and no.  You're right, but see below.

> : It may be that backreferences already demand backtracking.  Or some other 
> : feature might.  I don't know; I haven't thought it through.
> 
> I believe you are correct that backrefs require backtracking.  Maybe some 
> smart person will find a way to trace back through the states by which a 
> DFA matched to retrieve backref info, but that's probably worth a PhD or 
> two.

Well, there are certainly PhD students out there doing new research all the 
time.  Who knows what one will come up with one day?  It would suck if one 
gets a PhD for a super-clever pattern-matching algorithm, and we find that 
we can't use it because of hardcoded assumptions in the language design...

As for backtracing states of the DFA, see below.

> Minimal matching is also difficult in a DFA, I suspect.

Is it?  I'm not sure.  Since the DFA effectively follows all branches of 
the NFA at once, perhaps minimal matching is no more dificult than maximal?

Then again, maybe not. :-)

> : If we must have backtracking, so be it.  But if that's a tradeoff we're 
> : making for more expressive and powerful patterns, we should still at least 
> : make that tradeoff with our eyes open.  And if the tradeoff can be avoided, 
> : that's even better.
> 
> I refer you to http:://history.org/grandmother/teach/egg/suck.  :-)

Um, huh?

> That's a tradeoff I knowingly made in Perl 1.  I saw that awk had a
> DFA implementation, and suffered for it as a useful tool.

I suspect it's not practical to have an all-DFA implementation with nearly 
the power and expressiveness of Perl 4 or Perl 5 regexes, much less Perl 6.

On the other hand, many patterns have subpatterns which might benefit from 
using a DFA as an optimization.  You don't lose the expressiveness here,
if the backtracking NFA is available as well.

I'm just asking that we consider the impact on such optimizations, and see 
if we can leave the door open to reap the benefits without compromising the 
power and expressiveness we all want.  Maybe this just amounts to adding a 
few modifiers to allow semantic variants (like longest-trumps-leftmost), to 
enable optimizations that would otherwise impinge on correctness...

> And it's not just the backrefs.  An NFA can easily incorporate
> strategies such as Boyer-Moore, which actually skips looking at many of
> the characters, or the "scream" algorithm used by study, which can skip
> even more.  All the DFAs I've seen have to look at every character,
> albeit only once.  I suppose it's theoretically possible to compile
> to a Boyer-Moore-ish state machine, but I've never seen it done.

Okay, I confess that I've been saying "DFA" when I don't necessarily mean 
precisely that.  What I really mean is a "non-backtracking state machine" 
of some sort, but I'm calling it a DFA because it would be similar to one 
(to the degree possible) and people know what a DFA is.  I could say NBSM, 
but that seems confusing. :-)

Your objections to the limitations of a DFA are quite correct, of course.  
Modifications would be required to overcome the limits, and then it's no 
longer really a DFA, just like Perl 5's "regular expressions" are no longer 
really "regular expressions" in the mathematical sense.  I'm envisioning a 
state machine of some sort, which has a lot in common with a DFA but isn't 
strictly a DFA anymore.  If you prefer, I'll call it an NBSM, or I'm open 
to better suggestions for a name!

Anyway, to respond to your objections to a DFA:

* While you couldn't hand back a syntax tree from a true DFA, it should be
  possible to create an "NBSM" from a DFA recognizer, modified to record 
  whatever extra information is needed to execute the code that constructs 
  the syntax tree.  The NFA can't execute the code until it's certain which 
  match will succeed anyhow, and the DFA would have equivalent states where
  the code could be triggered.  Any needed information that was available 
  at earlier states would need to be saved along the way.

* As for backreferences, they might not be as bad as they look.  Certainly,
  you don't need a magical way to backtrace the previous states of the DFA;
  if necessary, each state could be logged so the history is available.
  Again, this wouldn't properly be a DFA anymore, but it would be an NBSM.

* Backreferences may be the same problem as capturing the matched text from 
  a parenthesized subpattern.  While a strict DFA couldn't remember where 
  the match started, an NBSM should be able to, by recording the starting
  position(s) being matched.  Once you've got the ability to capture the 
  matching text in the subpattern, perhaps you can dynamically match that 
  text while still in the NBSM?

* If minimal matching isn't as straightforward with a strict DFA, it should 
  be possible to turn that DFA into an NBSM that remembers the points in
  the string where smaller matches were found, so the minimal one can be 
  chosen where possible.  At least, I think this could work. :-)

* As for strategies such as Boyer-Moore, what inherent advantage does an 
  NFA have here?  If you add Boyer-Moore, it's not strictly an NFA anymore, 
  and the same goes for a DFA.  Is there some reason why certain states in 
  an NBSM couldn't switch algorithms to match the next block of fixed text?
  I don't see any reason why such optimizations couldn't also be applied to 
  an NBSM, just as they are to the NFA implementation.

I'm sure that some of the problems with DFAs can't be solved, or are just 
too expensive (probably in memory) to consider.  Still, many of the issues 
may become more tractable if you don't confine yourself to the very strict 
mathematical definition of a DFA.

I call it a DFA to get the basic idea across, but I really mean a state 
machine which is similar to a DFA, but which maintains state information 
and perhaps incorporates other strategies (like Boyer-Moore) rather than 
the traditional, limited form of DFA that theoreticians would prefer...

> Add to that the fact that most real-life patterns don't generally do
> much backtracking, because they're written to succeed, not to fail.
> This pattern never backtracks, for instance:
> 
>     my ($num) = /^Items: (\d+)/;

True, but many patterns use alternation, which leads to backtracking...

> I'm not against applying a DFA implementation where it's useful
> and practical, but just because it's the "best" in some limited
> theoretical framework doesn't cut it for me.  Humans do a great
> deal of backtracking in real life, once the limits of their parallel
> pattern matching circuits are exceeded.  Even in language we often
> have to reparse sentences that are garden pathological.  Why should
> computers be exempt? :-)

I doubt we can be completely exempt without crippling the functionality, 
power and expressiveness we all want.

Ultimately, I think a hybrid solution might be ideal.  In some cases, an 
NFA is the best solution; it's quicker and easier.  In others, we might be 
able to locally optimize subpatterns into DFA-like NBSMs for extra speed, 
with little cost.  In a few cases, the high-speed NBSM might be very costly 
but worthwhile for the application in question -- and this is where some 
sense of the programmer's priorities could be helpful to decide whether to 
just use the NFA or optimize more aggressively...

Deven

Reply via email to