[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

CJ Dennis Fri, 22 Nov 2013 00:21:39 -0800

------- You are receiving this mail because: -------
You are on the CC list for the bug.


http://bugs.exim.org/show_bug.cgi?id=1416

CJ Dennis <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |INVALID




--- Comment #3 from CJ Dennis <[email protected]>  2013-11-19 23:46:12 ---
@Graycode
The equivalent English regex would be:
<?php
print(preg_replace('/(?<!k)/u', '*', 'k') . "\n"); // /u is unnecessary but
harmless as no UTF-8 characters
print(preg_replace('/(?<!k)/u', '*', 'm') . "\n");
?>

--------
Actual output:
*k
*m*

@Philip Hazel
I could see the characters correctly in your reply. The regex is looking for
any position where the Sinhala letter for 'k' is not before and insert a '*'.
the string to search in is the third argument (Sinhala 'k' and Sinhala 'm').
Yes, /u is to turn on UTF-8 mode, otherwise PHP treats the pattern as ASCII
(albeit with 8 bits).

You seem to be correct that it's not a PCRE bug. I downloaded pcre-7.0.exe (I'm
using Windows), read the pcre-man.pdf file and constructed a test file with the
regex and patterns in it. I got one match for the first string and two matches
for the second. If the bug was present I would expect to see four matches, not
two. I'm assuming therefore that PCRE is behaving correctly (assuming no bug
has crept in between 7.0 and 8.32).

I will report this bug on the PHP site. Thanks for your time!

By the way the input file I used was:
--------
/(?<!(ක))/g8
ක
ම
--------

and the output was:
--------
PCRE version 7.0 18-Dec-2006

/(?<!(ක))/g8
ක
 0: 
ම
 0: 
 0: 

--------


-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

Reply via email to