ID:               33334
 User updated by:  kloske at tpg dot com dot au
 Reported By:      kloske at tpg dot com dot au
 Status:           Bogus
 Bug Type:         PCRE related
 Operating System: Linux
 PHP Version:      4.3.10
 New Comment:

Look I don't really care anymore one way or another because I've
figured out now how it all works on a level that's detailed enough for
me to understand correctly enough to write useful stable and correct
code, but just for interest's sake, my regex used quotes because:

1. I needed other escapes and variables in there which single quotes
will not allow, and the alternative was using lots of dot notation
which looked uglier than using double quotes.

2. The documentation of which you speak, where this is apparently
documented, http://au.php.net/preg_match, examples 1-3 (the only
numbered examples on this documentation page) all use double quotes. As
an aside, all three examples are wrong, or at best highly misleaing,
since they use \b which inside a double quote escapes it before it ever
gets to the PCRE code. I ran some tests today, and inside a double
quote, its much more correct to use \\b instead of \b. Whilst it will
work since PCRE is smarter than us, when it comes to \\ it won't,
because PCRE is also more careful than us and assumes when it sees the
resulting \ that we're trying to escape something.

3. I really really wanted to. Single or double quotes, regex is regex.
I am sorry if I violated your preference. I should point out that regex
is now 860+ characters long, so it ain't going to be easy to read in
single or double quotes. I merely compressed it down and stuck with the
format I was using.

In spite of all this, I couldn't find anywhere in the PHP doco's that
they specifically mentioned the stuff about backspacing, and as I
mentioned in point 2 above far from it they in fact mislead in their
examples.


Previous Comments:
------------------------------------------------------------------------

[2005-06-15 18:58:13] [EMAIL PROTECTED]

It would be a hell of a lot easier to read your regexes if you would
use single quotes.  eg.

$r = '/^"([^\\"]|\\\\|\\")*"$/';
$s = '"some text","test \\"';
preg_match($r, $s, $m);
var_dump($m);

for your above example.  And this stuff is documented.



------------------------------------------------------------------------

[2005-06-15 12:03:40] kloske at tpg dot com dot au

Okay, the PCRE people have gotten back to me, and PCRE has proven to
produce the correct expected behavior and my test case has not failed.

So now we're left with a test case which fails in PHP yet works on
PCRE.

For a more stark example, consider the following PHP code:

$r = "/^\"([^\\\"]|\\\\|\\\")*\"\$/";
$s = "\"some text\",\"test \\\"";
preg_match($r, $s, $m);
var_dump($m);

$m should be empty, since $s does not match $r, yet the following is
returned:

array(2) { [0]=> string(20) ""some text","test \"" [1]=> string(1) "\"
} 

Note that the last element of the array contains a single backslash,
indicating that the last choice that matched was a backslash, which is
NOT ONE OF THE THREE CHOICES.

So, the PCRE people explained that they were not familiar with PHP but
wondered if it is an escaping issue.

Does PHP require you to DOUBLE escape regex? ie, to match a sequence of
two backslashes in a row, do you need to write "\\\\\\\\"? I've tried
doing this and it seems to give the expected behavior, yet the manual
does not mention this fact, and worse the user comments seem to
indicate that you should not double escape (since no one is trying to
do two backslashes in a row anywhere).

I'd say this is a documentation ~defficiency~ more than anything, since
it should be made clear that you need to escape the string first, which
then will need to be escaped again for correct interpretation by PCRE
if you are trying to include a literal backslash, or in other
situations where PCRE needs to escape things.

To recap, this is what you apparently need to write in PHP to match a 
literal of two backslashes next to each other:

"\\\\\\\\"

Gotta love it!

Because:

The number of backslashes are halved when PHP encodes it as a string,
then 
it passes it literally to PCRE, which halves the number of backslashes

again, to the final figure of two backslashes!

Simple when you understand, not even hinted at in the PHP
documentation.

------------------------------------------------------------------------

[2005-06-15 11:22:32] kloske at tpg dot com dot au

As a more simple test case, this literal text string:

"test","string\"

matches the folling REGEX pattern:

^"([^\"]|\\|\")*"$

Reversing the sense of REGEX to being a pattern GENERATOR, there is no
way for that REGEX pattern to generate the string above.

I've reported this to the PCRE people and will keep you all posted as
to the reply.

------------------------------------------------------------------------

[2005-06-15 01:18:47] kloske at tpg dot com dot au

Thank you for that information - it is much appreciated. I will take
this up with the PCRE people, as I still believe this to be incorrect
behavior.

FYI, the documentation I was reading was the regex man pages on both
solaris and linux. My peers were people who've studied regular
expressions (as have I), and agreed that based on the definitions we've
all seen in our respective studies (though none of us have studied PCRE
specifically as an implementation) that the behavior we saw was a
violation of matching conditions, as specified in the test case's
regular expression.

ie: based on your greedy quote from the PCRE pages, I do not want it to
match a minimum number of times, I want it to match as much as possible.
Note the word possible; this regex did not allow it to match as much as
it did - IE, it became very greedy indeed, to the point of matching
text it wasn't allowed to!

------------------------------------------------------------------------

[2005-06-14 17:35:48] [EMAIL PROTECTED]

I have no idea what manuals you are reading or which peers you are
talking to, but in perl-style regular expressions the '?' character is
overloaded and has different meanings in different contexts.  Type "man
perlre" at your Unix prompt and you will see:

       By default, a quantified subpattern is "greedy", that is, it
will match
       as many times as possible (given a particular starting location)
while
       still allowing the rest of the pattern to match.  If you want it
to
       match the minimum number of times possible, follow the
quantifier with
       a "?".

If you still don't understand this, take it up with the developers of
the PCRE library over at http://pcre.org since that is the code PHP
uses.  Even if somebody here agreed that there is a bug, it would have
to be fixed by the PCRE folks.

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/33334

-- 
Edit this bug report at http://bugs.php.net/?id=33334&edit=1

Reply via email to