------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1416 Summary: UTF-8 lookbehinds match bytes instead of characters Product: PCRE Version: 8.32 Platform: Other OS/Version: Windows Status: NEW Severity: bug Priority: medium Component: Code AssignedTo: [email protected] ReportedBy: [email protected] CC: [email protected] I will use some PHP code with Sinhala UTF-8 characters (3 bytes each) to illustrate the bug. <?php print(preg_replace('/(?<!ක)/u', '*', 'ක') . "\n"); // Works properly print(preg_replace('/(?<!ක)/u', '*', 'ම') . "\n"); // Triggers the bug ?> -------- Actual output: *ක *�*�*�* -------- Expected output: *ක *ම* It appears to check every byte position within the UTF-8 encoded character and backtrack until a valid starting byte is found, then check it. If you remove the improperly inserted characters the resulting bytes do encode the original character correctly. Once it has matched the beginning of the string it should then move ahead by a character, not by a byte. It is possible this is a PHP bug and not a PCRE bug but I have no way of testing PCRE separately. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
