https://bugs.exim.org/show_bug.cgi?id=1866

            Bug ID: 1866
           Summary: UTF-8 class containing \D and \P{Nd} matches
                    incorrectly
           Product: PCRE
           Version: 8.39
          Hardware: x86
                OS: All
            Status: NEW
          Severity: bug
          Priority: medium
         Component: Code
          Assignee: p...@hermes.cam.ac.uk
          Reporter: justin.vii...@intel.com
                CC: pcre-dev@exim.org

We found an issue with this pattern, which has the PCRE_UTF8 flag set but not
PCRE_UCP:

    /[\D\P{Nd}]/8

pcretest -d shows this class being interpreted as the union of "all non-digit
characters up to \xff" and "all characters not in \p{Nd}":

$ ./pcretest -d
PCRE version 8.39 2016-06-14

  re> /[\D\P{Nd}]/8
------------------------------------------------------------------
  0  43 Bra
  3     [\x00-/:-\xff\P{Nd}]
 43  43 Ket
 46     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf
No first char
No need char
data> 0
No match
data> _
 0: _
data> \x{1d7cf}
No match

However, my reading of the pcrepattern documentation suggests that without the
PCRE_UCP flag, \D should be interpreted as "all non-digit characters up to \xff
and *all other characters*, meaning that the last test case above, U+1D7CF
(mathematical bold digit one) should match.

This is the case for this pattern if we use /\D/8 on its own, or if we
transform the pattern above into an alternation:

  re> /\D|\P{Nd}/8
data> 0
No match
data> a
 0: a
data> \x{1d7cf}
 0: \x{1d7cf}

I have checked both PCRE 8.39 and PCRE 10.22 and they both show the same
behaviour.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to