ID:               33334
 User updated by:  kloske at tpg dot com dot au
 Reported By:      kloske at tpg dot com dot au
 Status:           Bogus
 Bug Type:         PCRE related
 Operating System: Linux
 PHP Version:      4.3.10
 New Comment:

Thank you for that information - it is much appreciated. I will take
this up with the PCRE people, as I still believe this to be incorrect
behavior.

FYI, the documentation I was reading was the regex man pages on both
solaris and linux. My peers were people who've studied regular
expressions (as have I), and agreed that based on the definitions we've
all seen in our respective studies (though none of us have studied PCRE
specifically as an implementation) that the behavior we saw was a
violation of matching conditions, as specified in the test case's
regular expression.

ie: based on your greedy quote from the PCRE pages, I do not want it to
match a minimum number of times, I want it to match as much as possible.
Note the word possible; this regex did not allow it to match as much as
it did - IE, it became very greedy indeed, to the point of matching
text it wasn't allowed to!


Previous Comments:
------------------------------------------------------------------------

[2005-06-14 17:35:48] [EMAIL PROTECTED]

I have no idea what manuals you are reading or which peers you are
talking to, but in perl-style regular expressions the '?' character is
overloaded and has different meanings in different contexts.  Type "man
perlre" at your Unix prompt and you will see:

       By default, a quantified subpattern is "greedy", that is, it
will match
       as many times as possible (given a particular starting location)
while
       still allowing the rest of the pattern to match.  If you want it
to
       match the minimum number of times possible, follow the
quantifier with
       a "?".

If you still don't understand this, take it up with the developers of
the PCRE library over at http://pcre.org since that is the code PHP
uses.  Even if somebody here agreed that there is a bug, it would have
to be fixed by the PCRE folks.

------------------------------------------------------------------------

[2005-06-14 16:46:09] [EMAIL PROTECTED]

It really is bogus: PHP uses the PCRE library underneath the preg_*
functions. If there is any bug (IMO there is not bug), then it's in
PCRE, so report this to the authors of that.


------------------------------------------------------------------------

[2005-06-14 12:23:04] kloske at tpg dot com dot au

I do not believe this bug to be bogus or resolved.

------------------------------------------------------------------------

[2005-06-14 12:20:09] kloske at tpg dot com dot au

Hi, strangely enough, you are correct that placing a question mark (for
exactly 0 or 1 matches) works.

*however*, this opens up more questions than it answers (and to my mind
brings to light perhaps deeper bugs). The regex manuals all have the
following to say:

1. The behavior of multiple adjacent duplication symbols (+, *, ? and
intervals) produces undefined results.

2. * matches zero or more occurrances, so ignoring (1), *? taken to
mean what is most obvious means "zero or more repeated once or not at
all" which definitely logically collapses down to "zero or more" which
is what * means on its own, which is (a) what I had, and (b) logically
equivalent to the suggested solution.

3. '/' and '"' NEVER (even when greedy) match ([^\"]|\\|\"), which my
test case clearly demonstrates the PHP regular expression engine
doing.

(1) would tend to suggest that *? as the correct way to achieve what I
am after is undefined and therefore not correct.

(2) seems to indicate that failing (1), the two expressions should be
equivalent and both produce the same behavior (which they clearly do
not)

and

(3) cannot possibly be explained by ANY alternative solution since it
clearly violates all possible ways of interpreting the regex.

Put simply: any sequence of characters generated from this regular
expression ([^\"]|\\|\") can never contain a single backslash or a
quote that is not proceeded by a backslash, yet the match that PHP's
regular expression engine is returning violates this precondition.

I can see three possible situations occurring here:

1. PHP regex differs from the standard forms of regex available on
POSIX systems, and whilst this may be desirable it needs to be clearly
documented (which it currently is not - it is not even hinted at).

2. PHP regex has a bug with its handling of zero or more repetition
generators.

3. There is something which I still am missing after repeated
inspections, readings of the relevant manuals, and consultation with
peers.

------------------------------------------------------------------------

[2005-06-14 09:52:54] [EMAIL PROTECTED]

Regular expressions are greedy by default.  Change it to:

$r_text = "(\"(([^\\\"]|\\\\|\\\")*?)\"|[^\",][^,]*?)";

or use the U modifier on the call and I bet it will do what you want. 
There is no bug here.  

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/33334

-- 
Edit this bug report at http://bugs.php.net/?id=33334&edit=1

Reply via email to