Re: Perl 5's "non-greedy" matching can be TOO greedy!

Tom Christiansen Thu, 14 Dec 2000 13:12:51 -0800
>Does anyone disagree with the premise, and believe that "bbbbccccd" is the
>CORRECT match for the non-greedy regexp above?

Yes.  The Camel's regex chapter reads:

    You might say that eagerness holds priority over greed (or thrift).

>For what it's worth, here's a quote from a Perl 5.005_03 "perlre" manpage:

>     By default, a quantified subpattern is "greedy", that is, it will
>     match as many times as possible (given a particular starting
>     location) while still allowing the rest of the pattern to match.
>     If you want it to match the minimum number of times possible,
>     follow the quantifier with a "?".  Note that the meanings don't 
>     change, just the "greediness":

>I don't believe that ".*?" matching "bbbcccc" above qualifies as "to match
>the minimum number of times possible", when it is possible only to match
>the "cccc" and still match the full regexp.  Since the documentation makes
>no mention of earliest-match in this paragraph, I can only assume this is
>unintended behavior, but I'm asking to check my assumptions.  Any devil's
>advocates out there who want to argue for the current behavior?

The simple story is this:

    Rule 1: Given two matches at *different* starting points, the
            one that occurs earlier wins.

    *OTHERWISE*

    Rule 2: Given two matches at the *same* starting points, the
            one that is longer wins.

Or, more lengthly:

    Given the opportunity to match something a variable number of
    times, maximal quantifiers will elect to maximize the repeat
    count.  So when we say "as many times as you'd like", the greedy
    quantifier interprets this to mean "as many times as you can
    possibly get away with", constrained only by the requirement
    that this not cause specifications later in the match to fail.
    If a pattern contains two open-ended quantifiers, then obviously
    both cannot consume the entire string: characters used by one
    part of the match are no longer available to a later part.  Each
    quantifier is greedy at the expense of those that follow it,
    reading the pattern left to right.

    That's the traditional behavior of quantifiers in regular
    expressions.  However, Perl permits you to reform the behavior
    of its quantifiers: by placing a C<?> after that quantifier,
    you change it from maximal to minimal.  That doesn't mean that
    a minimal quantifier will always match the smallest number of
    repetitions allowed by its range, any more than a maximal
    quantifier must always match the greatest number allowed in its
    range.  The overall match must still succeed, and the minimal
    match will take as much as it needs to succeed, and no more.
    (Minimal quantifiers value contentment over greed.)

    For example, in the match:

        "exasperate" =~ /e(.*)e/    #  $1 now "xasperat"

    the C<.*> matches "C<xasperat>", the longest possible way for
    it to match.  (It also stores that value in C<$1>, as described
    below under "Capturing and Clustering".)  Although there was a
    shorter match available, a greedy match doesn't care.  Given
    two choices at the same starting point, it always returns the
    I<longer> of the two.

    Contrast this with this:

        "exasperate" =~ /e(.*?)e/   #  $1 now "xasp"

    Here, the minimal matching version, C<.*?>, is used.  Adding
    the C<?> to C<*> makes C<*?> take on the opposite behavior: Now
    given two choices at the same starting point, it always returns
    the I<shorter> of the two.

    Although you could read C<*?> as saying to match zero or more
    of something but preferring zero, that doesn't mean it will
    always match zero characters.  If it did so here, for example,
    and left C<$1> set to C<"">, then the second "C<e>" wouldn't
    be found, since it doesn't immediately follow the first one.

    You might also wonder why, in minimally matching C</e(.*?)e/>,
    Perl didn't stick "C<rat>" into C<$1>.  After all, "C<rat>"
    also falls between two C<e>'s, and is shorter than "C<xasp>".
    In Perl, the minimal/maximal choice applies only when selecting
    the shortest or longest from among several matches that all
    have the same starting point.  If two possible matches exist,
    but these start at different offsets in the string, then their
    lengths don't matter--and neither does whether you've used a
    minimal quantifier or a maximal one.  The earliest of several
    valid matches always wins out over all latecomers.  It's only
    when multiple possible matches start at the same point that you
    use minimal or maximal matching to break the tie.  If the
    starting points differ, there's no tie to break.  Perl's matching
    is normally I<leftmost longest>; with minimal matching, it
    becomes I<leftmost shortest>.  But the "leftmost" part never
    varies, and is the dominant criterion.<footnote> Not all regex
    engines work this way.  Some believe in overall greed, in which
    the longest match always wins, even if it shows up later.  Perl
    isn't that way.  You might say that eagerness holds priority
    over greed (or thrift).  For a more formal discussion of this
    principle and many others, see the section on "The Little Engine
    that /Could(n't)?/".</footnote>

    There are two ways to defeat the leftmost leanings of the pattern
    matcher.  First, you can use an earlier greedy quantifier
    (typically C<.*>) to try to slurp earlier parts of the string.
    In searching for a match for a greedy quantifier, it tries for
    the longest match first, which effectively searches the rest
    of the string right-to-left:

        "exasperate" =~ /.*e(.*?)e/   #  $1 now "rat"

    But be careful with that, since the overall match now includes
    the entire string up to that point.

    The second way to defeat leftmostness is using positional
    assertions, discussed in the next section.


You should probably read the whole chapter.

--tom
Re: Perl 5's "non-greedy" matching can be TOO greedy!

Reply via email to