[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

Philip Hazel Wed, 20 Nov 2013 01:24:21 -0800

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1416

--- Comment #4 from Philip Hazel <[email protected]>  2013-11-20 09:23:29 
---
On Tue, 19 Nov 2013, CJ Dennis wrote:

> @Graycode
> The equivalent English regex would be:
> <?php
> print(preg_replace('/(?<!k)/u', '*', 'k') . "\n"); // /u is unnecessary but
> harmless as no UTF-8 characters
> print(preg_replace('/(?<!k)/u', '*', 'm') . "\n");
> ?>

Thanks, Graycode, for analyzing this much better than I did. I should 
have saved the message in raw form to see what the characters actually 
were (my xterm screen on Linux showed just spaces). Now that I've done 
that, of course I see the actual bytes. 

> I'm assuming therefore that PCRE is behaving correctly (assuming no bug
> has crept in between 7.0 and 8.32).

Thanks for the test. I have now constructed my own test, both with 
actual UTF-8 bytes and using escapes, and checked that it is OK both 
with 8.33 and with the forthcoming 8.34. This is the output:

PCRE version 8.34-RC1 2013-11-19

/(?<!(\x{D9A}))/g8
\x{D9A}
 0: 
\x{DB8}
 0: 
 0: 

/(?<!(ක))/g8
ක
 0: 
ම
 0: 
 0: 

Incidentally, if you ever need a Windows binary for pcretest again, note 
that a PCRE user has the latest releases of pcretest and pcregrep
available for download here:

http://www.rexegg.com/pcregrep-pcretest.html 

Philip

-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 1416] UTF-8 lookbehinds match bytes instead of characters

Reply via email to