[pcre-dev] [Bug 897] New: \w and others based on Unicode properties

Pavel Kostromitinov Mon, 19 Oct 2009 09:07:09 -0700

------- You are receiving this mail because: -------
You are on the CC list for the bug.


http://bugs.exim.org/show_bug.cgi?id=897
           Summary: \w and others based on Unicode properties
           Product: PCRE
           Version: N/A
          Platform: x86
        OS/Version: Windows
            Status: NEW
          Severity: wishlist
          Priority: medium
         Component: Code
        AssignedTo: [email protected]
        ReportedBy: [email protected]
                CC: [email protected]


A quote from pcre documentation:
---
The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
characters of any code value, but the characters that PCRE recognizes as
digits, spaces, or word characters remain the same set as before, all with
values less than 256. This remains true even when PCRE includes Unicode
property support, because to do otherwise would slow down PCRE in many common
cases. If you really want to test for a wider sense of, say, "digit", you must
use Unicode property tests such as \p{Nd}. Note that this also applies to \b,
because it is defined in terms of \w and \W.
---

I do appreciate concern for speed in pcre.

However, having to deal with international characters almost constantly, I
would really appreciate something like a compile-time option (for compiling
pcre) to force it into using Unicode properties always.
I cannot just replace all the "\b" with complex constructions based on \p{},
since I don't write patterns myself - end-users do it. And parsing their
patterns just to make correct replacement doesn't look appealing to me either.

At least, I would greatly appreciate a hint on where should I look in pcre
sources to try and change this behaviour myself.


-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email

-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 897] New: \w and others based on Unicode properties

Reply via email to