Re: [pcre-dev] \b bug with extended Unicode characters?

Ralf Junker Sun, 29 Mar 2009 03:22:58 -0700

Philip Hazel wrote:

>Did this ever get answered? The answer is that it is a limitation of
>PCRE. I have upgraded the documentation about \b to make it even
>clearer. It now says this:
>
>   In UTF-8 mode, characters with values greater than 128 never match
>   \d, \s, or \w, and always match \D, \S, and \W. This is true
>   even when Unicode character property support is available. These
>   sequences retain their original meanings from before UTF-8 support was
>   available, mainly for efficiency reasons. Note that this also affects
>   \b, because it is defined in terms of \w and \W.


Many thanks, I welcome this additional documentation to (I suppose?) 
the PCRE pattern page. With this, I would not have to look up the 
PCRE source code to answer my question.

Only later I found that the PCRE help already documents the limitations:

6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly 
test characters of any code value, but the characters that PCRE 
recognizes as digits, spaces, or word characters remain the same set 
as before, all with values less than 256. This remains true even when 
PCRE includes Unicode property support, because to do otherwise would 
slow down PCRE in many common cases. If you really want to test for a 
wider sense of, say, "digit", you must use Unicode property tests 
such as \p{Nd}.

However, this paragraph is in pcre.html#utf8support, section "General 
comments about UTF-8 mode". I believe that most pattern writers will 
miss out on this, so its extra mention in the pattern documentation 
is very much appreciated!

Ralf 


-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] \b bug with extended Unicode characters?

Reply via email to