ID: 14893
User updated by: [EMAIL PROTECTED]
Reported By: [EMAIL PROTECTED]
Status: Bogus
Bug Type: PCRE related
Operating System: SunOS
PHP Version: 4.1.1
New Comment:

The bug is in PCRE, as the category states -- I am merely bringing to
the attention of the PHP developers that this bug exists in the regex
engine it employs.  I have contacted the author of PCRE, and he'll fix
it when the next version of PCRE is released.

As for why it is properly a bug:

  "ab1b" =~ /(.*)\d+\1/

should match as follows (assuming absolutely no optimizations are
done);

  []     [ab1b]  OPEN 1
  []     [ab1b]  STAR ANY
  [ab1b] []      CLOSE 1
  [ab1b] []      PLUS DIGIT
                   fail
  [ab1]  [b]     CLOSE 1
  [ab1]  [b]     PLUS DIGIT
                   fail
  [ab]   [1b]    CLOSE 1
  [ab]   [1b]    PLUS DIGIT
  [ab1]  [b]     REF 1
                   fail
  [a]    [b1b]   CLOSE 1
  [a]    [b1b]   PLUS DIGIT
                   fail
                   start over
  [a]    [b1b]   OPEN 1
  [a]    [b1b]   STAR ANY
  [ab1b] []      CLOSE 1
  [ab1b] []      PLUS DIGIT
                   fail
  [ab1]  [b]     CLOSE 1
  [ab1]  [b]     PLUS DIGIT
                   fail
  [ab]   [1b]    CLOSE 1
  [ab]   [1b]    PLUS DIGIT
  [ab1]  [b]     REF 1
  [ab1b] []      DONE

You can see that this regex should succeed (at least, I hope I've made
that clear).  The problem is that the PCRE engine optimizes a .* at the
beginning of a regex to be implicitly anchored with ^, since it seems
obvious that if .* is going to match anywhere, it will end up matching
at the beginning of the string.  This is perfectly sensible except in
the case where that .* is captured and used later in the regex, as my
case shows.


Previous Comments:
------------------------------------------------------------------------

[2002-01-27 01:03:39] [EMAIL PROTECTED]

a) Not a PHP bug (but its good to be aware of this issue, if you
wouldn't mind please send a mail to [EMAIL PROTECTED] with any updates,
etc.)

b) not sure if this is really a bug, the way I read the 1st regex is:

read in ab
put that as \1
after a digit match \1
which is ab
after the digit there is only b

whereas in the second example you make the regex non-greedy, so
therefore it matches from the beginning of the string and matches the
ab from the lookahead assertion.

I could be wrong, but either way its not a PHP bug ;)  If you disagree
please follow up at [EMAIL PROTECTED]

regards,
sterling

------------------------------------------------------------------------

[2002-01-06 17:57:14] [EMAIL PROTECTED]

Here's the problem:

<? echo preg_match('/(.*)\d+\1/', 'ab1b'); ?>

It fails, but it really shouldn't.  You can fool the engine into not
having the bug:

<? echo preg_match('/(?=)(.*)\d+\1/', 'ab1b'); ?>

The bug is thus:  a regex that starts with .* can logically be made to
start with an implicit anchor to the beginning of the string.  However,
this optimization can break the success of a regex if the .* is
captured (as above) and used later (the back-reference \1).  I've
contacted the author of the PCRE package.

------------------------------------------------------------------------



Edit this bug report at http://bugs.php.net/?id=14893&edit=1


-- 
PHP Development Mailing List <http://www.php.net/>
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
To contact the list administrators, e-mail: [EMAIL PROTECTED]

Reply via email to