------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1208 Giuseppe D'Angelo <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |[email protected] --- Comment #11 from Giuseppe D'Angelo <[email protected]> 2012-02-10 21:15:46 --- (In reply to comment #3) > To tell you the truth this issue has not been raised since I am here, so I > never estimated the required efforts. Is there any example implementation for > it (for UTF caseless compare)? The general rules of caseless string comparison are available here in section 3.13 (see f.i. Default Caseless Matching): http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf In short, it's a matter of casefolding the two strings, plus eventually rounds of normalizations. (In reply to comment #4) > I agree with Zoltan that this is a difficult area, and I have kept well clear > of it in the past. Consider this: you say > > For instance, "ß" (U+00DF LATIN SMALL LETTER SHARP S) should match "ss" > > I take it that you mean a U+00DF in the pattern should match "ss" in the > subject string. But, should "ss" in the pattern match U+00DF in the subject > string also? For the argument's sake, let's put special regexp symbols out of the equation: if ß matches "SS" case insensitively, then "SS" should match ß as well. In other words, it's logical to expect that both "straße" =~ /STRASSE/i and "STRASSE" =~ /straße/i match. > And what about a single accented character versus an unaccented character > followed by an accent? This seems to be adding quite a lot of "semantics" to > the business of character string matching. Indeed. That's point 2.1 here http://www.unicode.org/reports/tr18/#Extended_Unicode_Support > As Zoltan says, PCRE uses relatively compact tables, and it implements only > one-to-one case mappings. Changing this takes us to a whole new ballpark. I > guess it would involve a second table of some kind, and it would play havoc > with certain operations such as lookbehind, which rely on knowing, at compile > time, how many characters to go back. Having to check another table for each > character (even when not case folding) would no doubt slow things down > somewhat. I'm not sure it requires a lookup for each character. I'm no Unicode expert, but it seems to me that it's a kind of preprocessing to do on both the pattern and subject string in order to be able to do binary comparisons then. > Perhaps the solution might be to check the table at compile time and > compile a different opcode for characters that require additional testing, > though I am not really happy about this and I'm not sure how it would work for > character pairs such as "ss". The feature could, of course, be optional, which > would save everybody who is matching English wasting time checking for the > German Eszett. Indeed. > I don't know how Perl handles this. I guess one of us should check. It doesn't according to the perlunicode page (see RL2.1) http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-Support-Level (In reply to comment #5) > We already have such opcode: \R which matches to (?:\r|\n|\r\n) although it > cannot be used inside [] ranges. If the number of such letters are relatively > low (<5 for example), we might able to introduce special opcodes for them (Is > there any other besides Eszett?). Basically the ones with "F" status listed here: http://www.unicode.org/Public/6.1.0/ucd/CaseFolding.txt > The other opcodes would be implemented as > character ranges. Similar to \h and \v but the list on the list is long and > would require manual maintenance... Anyway if we decide to add multiple cases > thing \h and \v should become part of that new system. > > By the way, this would surely be unexpected for an average PCRE user: > - "Μ" (uppercase mu) matches "μ" (lowercase mu); > > Or uppercase mu has a different codepoint like different beta characters? It has! That was Μ (U+039C, GREEK CAPITAL LETTER MU), and the fact it matches is no surprise. What I was pointing out was the inconsistency when doing case insensitive matching with PCRE 8.30: "µ" (U+00B5, MICRO SIGN) matches "Μ" (U+039C, GREEK CAPITAL LETTER MU) but "Μ" (U+039C, GREEK CAPITAL LETTER MU) DOES NOT match "µ" (U+00B5, MICRO SIGN) as well as: "µ" (U+00B5, MICRO SIGN) matches "Μ" (U+039C, GREEK CAPITAL LETTER MU) "μ" (U+03BC, GREEK SMALL LETTER MU) matches "Μ" (U+039C, GREEK CAPITAL LETTER MU) but "µ" (U+00B5, MICRO SIGN) DOES NOT match "μ" (U+03BC, GREEK SMALL LETTER MU) (and viceversa). (In reply to comment #6) > I'm not sure if this is relevant and how Unicode defines it, but Greek has two > variants of lower-case sigma (σ, ς) while there is only one upper-case Σ. If > I were a Greek I would like to match bot of them. It's the same situation here (again, testing PCRE 8.30): - "σ" matches "Σ" - "Σ" matches "σ" - "ς" matches "Σ" - "Σ" DOES NOT match "ς" (?) And: "σ" and "ς" do not match each other. This is solvable by case folding, although it opens many other implementation problems. For instance, does it make sense to have "ß" =~ /(S)(S)/i ? What should $1 and $2 contain then? I have the feeling this discussion is shifting towards PCRE implementing Unicode regexps Level 2/3. :-) Cheers, Giuseppe D'Angelo -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
