ID: 40395 Updated by: [EMAIL PROTECTED] Reported By: jfrim at idirect dot com Status: Assigned -Bug Type: PCRE related +Bug Type: Documentation problem Operating System: * PHP Version: * -Assigned To: andrei +Assigned To: nlopess New Comment:
ok, so after talking with Andrei, we came up with the decision to document it rather than changin the behaviour (e.g. because of bug #5676). BTW, probably you'll want to consider using preg_replace_callback(). note to self: need to review again the escaped chars (at least NULL, single-quote and double-quote are) Previous Comments: ------------------------------------------------------------------------ [2007-02-08 21:55:58] jfrim at idirect dot com Another reason why it would be best to return NULL and DOUBLE-QUOTE (0x00 and 0x22 respectively) in regular expression back-references WITHOUT being escaped: If this bug was fixed by escaping the backslash as well... ...The the context of the resulting output string would be a mix of escaped and non-escaped data. (Since the input string is non-escaped, but back-references are escaped.) This would make it impossible to safely un-escape without risk of data corruption. The only way to handle this would be to use the "e" modifier in the regular expression and embed stripslashes() into the replacement string. That's extra processing overhead, and basically makes the entire preg_replace() function useless without the "e" modifier. It also defeats any possible purposes as to why the back-references are escaped in the first place. Boo to this solution! Alternatively, if this bug was fixed by returning NULL and DOUBLE-QUOTE without being escaped... When using preg_replace, the resulting string will always be in a non-encoded context. If a slash-encoded string is ever desired, the entire thing can be wrapped in addslashes() by the user, without ever risking destroying the integrity of the data. ------------------------------------------------------------------------ [2007-02-08 19:59:04] jfrim at idirect dot com The following code demonstrates 0x00 and 0x22 being escaped, without 0x5C being escaped. It creates an 8-bit ASCII text output, with the character value (in DECIMAL) enclosed within braces (except for escaped chars, in which case it ends up as "92"), followed by the actual character, then a CRLF, for all 256 characters. Note how the backslash (0x5C, decimal 92) is NOT escaped, and contrary to what [EMAIL PROTECTED] posted, the single-quote (0x27, decimal 39) is NOT escaped either. (The double-quote (0x22, decimal 34) is escaped instead.) <?php header('Content-Type: text/plain; charset=US-ASCII'); header('Content-Disposition: inline; filename=PCRE.txt'); header('Pragma: no-cache'); header('Expires: 0'); header('Cache-Control: no-cache; must-revalidate'); $teststring=''; for ($i=0; $i<=255; $i++) { $teststring.=chr($i); } echo preg_replace('/([\\x00-\\xFF])/e',"'{'.ord('\\1').'}\\1'.chr(13).chr(10)",$teststring); ?> ------------------------------------------------------------------------ [2007-02-08 19:47:10] jfrim at idirect dot com I have verifed that along with 0x00 being escaped, 0x22 (the double-quote character) is also escaped. No other byte values are affected. Even if the documentation was changed to reflect this escaped behaviour of 0x00 and 0x22, there would still be a bug with this behaviour since 0x5C (the backslash character) is NOT escaped! This would create a discrepency problem if the input string to a preg_replace() contained a literal backslash followed by a number zero, or a backslash followed by a double-quote. There would be no way to tell from the resulting preg_replace'd data if those sequences are escaped NULLs and escaped double-quotes, or if those were literal sequences in the input string. So the only way to fix this bug is to either... ...A: Escape the backslash as well, and change the documentation to state that 0x00, 0x22, and 0x5C are escaped, or... ...B: Do not escape any characters. I would say method B is preferred, since no stripslashes() would have to be performed on the resulting output from a preg_replace(), and it's far more intuitive to always know that a regular expression back-reference will always contain the exact byte value that was matched, without having to worry about special exceptions. ------------------------------------------------------------------------ [2007-02-08 13:17:59] [EMAIL PROTECTED] Ok, so the problem here is that preg_do_eval() calls php_addslashes_ex(), that escapes "'", "\" and "\0". So we should either not escape the \0 or reflect the behaviour in the docs. Assigning to the extension maintainer. ------------------------------------------------------------------------ [2007-02-08 06:01:32] jfrim at idirect dot com I'd also like to present bug #16590: http://bugs.php.net/bug.php?id=16590 Note the following example they list as a SOLUTION to specifying NULLs in the pattern: preg_match("/\\x00/", "foo\0bar") And note the following statement from bug report #16590: "...The docs state that PCRE is binary safe..." So if PCRE is binary safe, and you can specify NULLs in the pattern with \x00, why are back references unable to return these matched NULLs?!?!? How is this NOT a bug?!?? ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/40395 -- Edit this bug report at http://bugs.php.net/?id=40395&edit=1