------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=897 Summary: \w and others based on Unicode properties Product: PCRE Version: N/A Platform: x86 OS/Version: Windows Status: NEW Severity: wishlist Priority: medium Component: Code AssignedTo: [email protected] ReportedBy: [email protected] CC: [email protected] A quote from pcre documentation: --- The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test characters of any code value, but the characters that PCRE recognizes as digits, spaces, or word characters remain the same set as before, all with values less than 256. This remains true even when PCRE includes Unicode property support, because to do otherwise would slow down PCRE in many common cases. If you really want to test for a wider sense of, say, "digit", you must use Unicode property tests such as \p{Nd}. Note that this also applies to \b, because it is defined in terms of \w and \W. --- I do appreciate concern for speed in pcre. However, having to deal with international characters almost constantly, I would really appreciate something like a compile-time option (for compiling pcre) to force it into using Unicode properties always. I cannot just replace all the "\b" with complex constructions based on \p{}, since I don't write patterns myself - end-users do it. And parsing their patterns just to make correct replacement doesn't look appealing to me either. At least, I would greatly appreciate a hint on where should I look in pcre sources to try and change this behaviour myself. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
