[pcre-dev] [Bug 1179] New: Missed utf8 caseless matches

Alex Coyte Thu, 24 Nov 2011 00:32:50 -0800

------- You are receiving this mail because: -------
You are on the CC list for the bug.


http://bugs.exim.org/show_bug.cgi?id=1179
           Summary: Missed utf8 caseless matches
           Product: PCRE
           Version: 8.20
          Platform: Other
        OS/Version: Linux
            Status: NEW
          Severity: bug
          Priority: low
         Component: Code
        AssignedTo: [email protected]
        ReportedBy: [email protected]
                CC: [email protected]


It appears a small number of codepoints are only matched caselessly in the
subject string when followed by more bytes. I suspect that the issue is related
to the fact that the other cased form of the code point requires fewer bytes to
encode in utf8.

For example, LATIN SMALL LETTER A WITH STROKE (\x{2c65}) should match
caselessly against LATIN CAPITAL LETTER A WITH STROKE (\x{23a}). However, it
only seems to match when there is at least one more byte in the subject string:

PCRE version 8.20 2011-10-21

  re> /ⱥ/8i
------------------------------------------------------------------
  0   7 Bra
  3  /i \x{2c65}
  7   7 Ket
 10     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: caseless utf8
No first char
No need char
data> ⱥ
 0: \x{2c65}
data> Ⱥ
No match
data> Ⱥ_
 0: \x{23a}

Interestingly, things work fine when in a character class:

  re> /[ⱥ]/8i
------------------------------------------------------------------
  0  15 Bra
  3     [\x{2c65}\x{23a}]
 15  15 Ket
 18     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: caseless utf8
No first char
No need char
data> ⱥ
 0: \x{2c65}
data> Ⱥ
 0: \x{23a}

This is happening on trunk (rev 765) as well as 8.20.

Other codepoints which seem to affected include:
LATIN CAPITAL LETTER I WITH DOT ABOVE (\x{130})
LATIN SMALL LETTER DOTLESS I (\x{131})
LATIN CAPITAL LETTER SHARP S (\x{1e9e})
GREEK PROSGEGRAMMENI (\x{1fbe})
OHM SIGN (\x{2126})
KELVIN SIGN (\x{212a})
ANGSTROM SIGN (\x{212b})


-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 1179] New: Missed utf8 caseless matches

Reply via email to