[pcre-dev] [Bug 1208] Case folding in PCRE

Philip Hazel Fri, 10 Feb 2012 04:24:05 -0800

------- You are receiving this mail because: -------
You are on the CC list for the bug.

http://bugs.exim.org/show_bug.cgi?id=1208

--- Comment #7 from Philip Hazel <[email protected]>  2012-02-10 12:23:41 
---
On Thu, 9 Feb 2012, Zoltan Herczeg wrote:

> We already have such opcode: \R which matches to (?:\r|\n|\r\n) although it
> cannot be used inside [] ranges. 

... and also \R cannot be used in lookbehinds. Let us suppose somebody 
writes a fragment of a pattern such as (?<=mass) which at the moment is 
ok - it looks behind 4 characters. If we allow ss to match ß (a single 
character) we will either have to disallow ss in lookbehinds, or 
completely re-implement the way they work. And of course the same thing 
the other way round (ß in pattern matching ss in the subject).

I see that that Perl also disallows \R in lookbehinds. However, although 
it does do some other special things, it is inconsistent. My version of
Perl 5.12.4 matches U+1f88 caselessly to U+1f80 (one character) or to
U+1f00 U+03b9 (two characters), but if you include U+1f88 in a
lookbehind, it does not complain, but matches only the single character.

> If the number of such letters are relatively low (<5 for example), we
> might able to introduce special opcodes for them (Is there any other
> besides Eszett?).

If we start including accented characters, the list could get quite 
long. I have not studied the Unicode document

http://www.unicode.org/reports/tr18/

in great detail, but one of the things it says is this:

  At Level 1, caseless matches do not need to handle cases where one 
  character matches against two.

I think that extending PCRE beyond level 1 would be a lot of work. It 
would certainly affect performance (as the Unicode document does in fact 
point out).

> The other opcodes would be implemented as character ranges. Similar to
> \h and \v but the list on the list is long and would require manual
> maintenance... Anyway if we decide to add multiple cases thing \h and
> \v should become part of that new system.

Yes, that is certainly a good point, depending on how this is 
implemented.

> By the way, this would surely be unexpected for an average PCRE user:
> - "Μ" (uppercase mu) matches "μ" (lowercase mu);

And likewise for "ss" matching U+00DF even without case folding.

> Or uppercase mu has a different codepoint like different beta characters?

There is an upper case μ, it is Μ (U+039C). PCRE already handles this 
case just fine.

I suppose that the first thing that might be done is to create a list of 
all the special cases to see how many there are, before deciding what, 
if anything, to do. However, I fear that the list will in fact turn out 
to be rather large because of all the character+accent combinations such 
as the U+1f88 example quoted above. Ah. Silly me. No need to do this; 
the Unicode people have already done it:

http://www.unicode.org/Public/6.1.0/ucd/SpecialCasing.txt

This file, which is not small, is concerned with case folding. It does
not specify that ß should match "ss", for example, just that, when 
casefolded, it should match "SS". The file is sufficiently large that it
would need to be processed into some kind of efficient indexed format
before being used, and I would certainly want as much of the work as
possible to be done at compile time. So yes, new opcodes would have to
be invented.

I would still be unhappy about having to dis-allow ß in lookbehinds, 
even if only when case folding.

-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 1208] Case folding in PCRE

Reply via email to