ID:               33334
 User updated by:  kloske at tpg dot com dot au
 Reported By:      kloske at tpg dot com dot au
 Status:           Open
 Bug Type:         Regexps related
 Operating System: Linux
 PHP Version:      4.3.10
 New Comment:

Note that due to issues with the CAPTCHA, I've somehow included the
wrong expected output and actual output.

The ACTUAL output is:
"some text","test \",thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(27) ""some text","test \",thing""
  [1]=>
  string(20) ""some text","test \""
  [2]=>
  string(18) "some text","test \"
  [3]=>
  string(1) "\"
  [4]=>
  string(6) "thing""
}

And the expected output is:
"some text","test \",thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(27) ""some text","test \",thing""
  [1]=>
  string(20) ""some text""
  [2]=>
  string(18) "some text"
  [3]=>
  string(1) "t"
  [4]=>
  string(6) ""test \", thing""
  [5]=>
  string(6) "test \", thing"
  [6]=>
  string(1) "g"
}

Sorry for the confusion.


Previous Comments:
------------------------------------------------------------------------

[2005-06-14 09:14:29] kloske at tpg dot com dot au

Description:
------------
Whilst trying to get a > 600 character regular expression to correctly
match input lines from a file I discovered some strange mismatching
which at first I imagined was a bug in my regex string until I reduced
it to the simple test case included below.

The test case shows some regex which should match limes that contain
two fields, seperated by a comma. Each field is identical and can
either be a string that does not start with a quote or a comma and
contains no commas in it OR starts with a quote and ends with a quote
and must contain only quotes or backslashes escaped with a preceeding
backslash. Ie: Two fields which may only be simple strings or be
c-style escaped strings seperated by a comma.

Note in my expected output I am making an educated guess as to what the
actual output would be, some of the other fields printed might be a
little different. The basics of the problem however are clearly
demonstrated.

The final thing to note is that if I exclude quotes from the middle or
end of the unquoted string case the problem vanishes. This leads me to
suspect the problem is somehow related to regex's handling of quotes.

Even if there are problems with my regex (I am well aware it is not
optimal or particularly "good" in any sense - be aware this is a cut
down test case only) this example clearly demonstrates php's regex
engine matching a string which contains characters which are clearly
excluded in the pattern which it matches.

I've tested this with one field and it doesn't appear to be a problem
there - it seems to only affect two fields one after another.

Reproduce code:
---------------
<?php

        $s = '"some text","test \",thing"';

        $r_text = "(\"(([^\\\"]|\\\\|\\\")*)\"|[^\",][^,]*)";
        
        $r_twofields = "${r_text},${r_text}";
        preg_match("/^${r_twofields}\$/", $s, $line);
        
        echo "<pre>";
        echo $s . "\n";
        echo $r_twofields . "\n";
        var_dump($line);
        echo "</pre>";

?>

Expected result:
----------------
"some text","test \",thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(27) ""some text","test \",thing""
  [1]=>
  string(20) ""some text","test \""
  [2]=>
  string(18) "some text","test \"
  [3]=>
  string(1) "\"
  [4]=>
  string(6) "thing""
}


Actual result:
--------------
"some text","test \", thing"
("(([^\"]|\\|\")*)"|[^",][^,]*),("(([^\"]|\\|\")*)"|[^",][^,]*)
array(5) {
  [0]=>
  string(28) ""some text","test \", thing""
  [1]=>
  string(20) ""some text","test \""
  [2]=>
  string(18) "some text","test \"
  [3]=>
  string(1) "\"
  [4]=>
  string(7) " thing""
}



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=33334&edit=1

Reply via email to