On Thu, Sep 06, 2007 at 04:02:19PM -0500, Patrick R. Michaud wrote: : I agree. One thought I had was that perhaps non-greedy matching : could also terminate the token prefix.
Well, that's more or less arguing it the other way. It kind of assumes your fooba-ish arguments are smart enough to test for non-r after. : > [...] : > I think longest-token semantics have to trump minimal matching here, : > and my argument is this. Most uses of *? have additional information : > on what terminates it, either implicitly in what it is matching, or : > explicitly in the next bit of regex. That is, you'd typically see : > either : > foo\w+? | fooba : > or : > foo.+? <wb> | fooba : > : > In either case, the clear intent is to match foobar over fooba. : > Therefore I think the DFA matcher just strips ? and does its ordinary : > character by character match, relying on that extra info to match : > the real extent of the quantifier. : : Does this still hold true for a non-greedy quantifier in the : middle of an expression... ? I.e., : : "foobazbar deborah" ~~ /foo .+? b.r | fooba | foobazb / : : matches "foobazbar debor" ? I suppose one approach would be to limit the DFA to the longest known constant string, and then rerun any variable branches individually to see if they will actually match something that long. Alternately, if the DFA engine is smart enough to kick out all the matches of the "foo .+? b.r" branch then it could conceivably be taught to return the shortest one rather than the longest one. Course, the question after than is, what if you mix minimal with maximal quantifiers? Another consideration is the efficiency of matching that pattern against "foobazbar" ~ 'b' x 100000000. That would definitely run faster if you cut off the DFA after the longest known token, or as soon as the DFA reduces to one possibility, presuming you can know such a thing. Any way we work it, there will be a way to use it wrong. Larry