------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=1100 --- Comment #10 from Philip Hazel <[email protected]> 2011-07-20 10:02:48 --- On Tue, 19 Jul 2011, Tom Hughes wrote: > We were actually using \X as a convenient approximation for extended grapheme > cluster ;-) I've now read (slightly) more carefully the Unicode documentation about grapheme clusters. The raw definition is "a base followed by zero or more continuing characters", which is sort of what one expects. However, the definition of what the base and what the continuing characters are has been expanded to be a lot more complicated than an "extended Unicode sequence", which is what \X in PCRE matches. It seems that Perl has been changed to match the full Unicode definition. I do not propose to change PCRE at this time, partly because I want to get the next release out reasonably soon, and partly because I think it needs careful thought as to what to do[*], and more research to understand the Unicode definition properly. When you created the test you posted, which is, in pcretest notation: /^S(\X*)e(\X*)$/8 Stéréo did you expect it to match or not to match? (In case people's displays mangle the subject string above, it consists of 8 Unicode characters, encoded in 10 UTF-8 bytes. After each "e" there is the character U+0301 (acute accent), represented as the two hex bytes CC, 81.) PCRE (now that I've fixed the bug) does not match. The first \X* matches "t", the "e" is then matched, but the second \X* won't match a sequence starting U+0301 because that is not a base character. Perl, however, *does* match in this case. This could be argued to be because the Unicode document says that "Degenerate cases include any isolated non-base characters..." Hmm. Is that non-base character really "isolated"? This whole area is clearly a minefield. I imagine that if you converted your e-acute representation to a single character, Perl would no longer match it. On the other hand, PCRE would match /^Ste/ against that string, so it isn't simple/consistent/straightforward either. Anyway, as I said, I propose to leave \X alone in PCRE, at least for the moment. It is documented as matching (?>\PM\pM*) which is, at least to me, easier to understand the the whole complication of extended grapheme clusters. :-) Philip [*] At present, the code for handling \X in all the various situations (single, repeated-minimized, repeated-maximized, etc) is in-line. For anything more complicated I suspect a subroutine is needed, which is going to slow things down. -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
