------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1208 --- Comment #4 from Philip Hazel <[email protected]> 2012-02-09 16:58:33 --- I agree with Zoltan that this is a difficult area, and I have kept well clear of it in the past. Consider this: you say For instance, "ß" (U+00DF LATIN SMALL LETTER SHARP S) should match "ss" I take it that you mean a U+00DF in the pattern should match "ss" in the subject string. But, should "ss" in the pattern match U+00DF in the subject string also? And what about a single accented character versus an unaccented character followed by an accent? This seems to be adding quite a lot of "semantics" to the business of character string matching. As Zoltan says, PCRE uses relatively compact tables, and it implements only one-to-one case mappings. Changing this takes us to a whole new ballpark. I guess it would involve a second table of some kind, and it would play havoc with certain operations such as lookbehind, which rely on knowing, at compile time, how many characters to go back. Having to check another table for each character (even when not case folding) would no doubt slow things down somewhat. Perhaps the solution might be to check the table at compile time and compile a different opcode for characters that require additional testing, though I am not really happy about this and I'm not sure how it would work for character pairs such as "ss". The feature could, of course, be optional, which would save everybody who is matching English wasting time checking for the German Eszett. I don't know how Perl handles this. I guess one of us should check. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
