On Tue, 19 Jul 2011, Tom Hughes wrote:

> We were actually using \X as a convenient approximation for extended grapheme
> cluster ;-)

I've now read (slightly) more carefully the Unicode documentation about 
grapheme clusters. The raw definition is "a base followed by zero or 
more continuing characters", which is sort of what one expects. However, 
the definition of what the base and what the continuing characters are 
has been expanded to be a lot more complicated than an "extended Unicode 
sequence", which is what \X in PCRE matches. 

It seems that Perl has been changed to match the full Unicode
definition. I do not propose to change PCRE at this time, partly because
I want to get the next release out reasonably soon, and partly because I
think it needs careful thought as to what to do[*], and more research to
understand the Unicode definition properly.

When you created the test you posted, which is, in pcretest notation:

/^S(\X*)e(\X*)$/8
Stéréo

did you expect it to match or not to match? (In case people's displays 
mangle the subject string above, it consists of 8 Unicode characters,
encoded in 10 UTF-8 bytes. After each "e" there is the character U+0301
(acute accent), represented as the two hex bytes CC, 81.)

PCRE (now that I've fixed the bug) does not match. The first \X* matches 
"t", the "e" is then matched, but the second \X* won't match a sequence 
starting U+0301 because that is not a base character. Perl, however, 
*does* match in this case. This could be argued to be because the 
Unicode document says that "Degenerate cases include any isolated 
non-base characters..." Hmm. Is that non-base character really 
"isolated"?

This whole area is clearly a minefield. I imagine that if you converted 
your e-acute representation to a single character, Perl would no longer 
match it. On the other hand, PCRE would match /^Ste/ against that 
string, so it isn't simple/consistent/straightforward either.

Anyway, as I said, I propose to leave \X alone in PCRE, at least for the 
moment. It is documented as matching (?>\PM\pM*) which is, at least to 
me, easier to understand the the whole complication of extended 
grapheme clusters. :-)

Philip

[*] At present, the code for handling \X in all the various situations 
(single, repeated-minimized, repeated-maximized, etc) is in-line. For 
anything more complicated I suspect a subroutine is needed, which is 
going to slow things down.

-- 
Philip Hazel
-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to