------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1208 --- Comment #7 from Philip Hazel <[email protected]> 2012-02-10 12:23:41 --- On Thu, 9 Feb 2012, Zoltan Herczeg wrote: > We already have such opcode: \R which matches to (?:\r|\n|\r\n) although it > cannot be used inside [] ranges. ... and also \R cannot be used in lookbehinds. Let us suppose somebody writes a fragment of a pattern such as (?<=mass) which at the moment is ok - it looks behind 4 characters. If we allow ss to match ß (a single character) we will either have to disallow ss in lookbehinds, or completely re-implement the way they work. And of course the same thing the other way round (ß in pattern matching ss in the subject). I see that that Perl also disallows \R in lookbehinds. However, although it does do some other special things, it is inconsistent. My version of Perl 5.12.4 matches U+1f88 caselessly to U+1f80 (one character) or to U+1f00 U+03b9 (two characters), but if you include U+1f88 in a lookbehind, it does not complain, but matches only the single character. > If the number of such letters are relatively low (<5 for example), we > might able to introduce special opcodes for them (Is there any other > besides Eszett?). If we start including accented characters, the list could get quite long. I have not studied the Unicode document http://www.unicode.org/reports/tr18/ in great detail, but one of the things it says is this: At Level 1, caseless matches do not need to handle cases where one character matches against two. I think that extending PCRE beyond level 1 would be a lot of work. It would certainly affect performance (as the Unicode document does in fact point out). > The other opcodes would be implemented as character ranges. Similar to > \h and \v but the list on the list is long and would require manual > maintenance... Anyway if we decide to add multiple cases thing \h and > \v should become part of that new system. Yes, that is certainly a good point, depending on how this is implemented. > By the way, this would surely be unexpected for an average PCRE user: > - "Μ" (uppercase mu) matches "μ" (lowercase mu); And likewise for "ss" matching U+00DF even without case folding. > Or uppercase mu has a different codepoint like different beta characters? There is an upper case μ, it is Μ (U+039C). PCRE already handles this case just fine. I suppose that the first thing that might be done is to create a list of all the special cases to see how many there are, before deciding what, if anything, to do. However, I fear that the list will in fact turn out to be rather large because of all the character+accent combinations such as the U+1f88 example quoted above. Ah. Silly me. No need to do this; the Unicode people have already done it: http://www.unicode.org/Public/6.1.0/ucd/SpecialCasing.txt This file, which is not small, is concerned with case folding. It does not specify that ß should match "ss", for example, just that, when casefolded, it should match "SS". The file is sufficiently large that it would need to be processed into some kind of efficient indexed format before being used, and I would certainly want as much of the work as possible to be done at compile time. So yes, new opcodes would have to be invented. I would still be unhappy about having to dis-allow ß in lookbehinds, even if only when case folding. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
