Re: Stack overflow in regexp matcher

Stefan Monnier Sun, 01 Apr 2007 11:50:53 -0700

> The regexp in question looks like: ".*some text.*". When used with
> string-match, removing both ".*"s shouldn't affect the matching
> semantically, right?


Differences (off the top of my head):
- match-beginning (as well as the return value of string-match)
- .* only matches chars other than \n
- .* matches greedily, so (string-match "foo" "foofoo") will only match the
  first "foo", whereas (string-match ".*foo" "foofoo") will match the whole
  string (the .* matches the first "foo").

> If so, would it be wise to add a regexp simplification step that, for
> example, could remove leading and trailing ".*"s in regular expressions
> passed to string-match?

In the code that calls string-match, yes, but within string-match it would
be terribly difficult to figure out whether the optimization is harmless.

> I realise now is not the time to talk about feature additions, but it
> would certainly make my life simpler if any regular expression that *can*
> be compiled to a DFA is.

I don't think you'll find much disagreement here.

> (I also wonder where "regular expressions" that are more than regular
> came from.  Just because they are useful?)

No idea.  Originally, only the \N backref was non-regular, but PCRE
introduced many others.  Note that using a DFA to implement unix
regexp-matching is actually pretty difficult because of the precise
semantics of sub-group matching.  So most "early" implementations used
backtracking, which made non-regular extensions "free".


        Stefan


_______________________________________________
emacs-pretest-bug mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug

Re: Stack overflow in regexp matcher

Reply via email to