[perl #128550] [BUG] <[a..z]> ranges break grapheme awareness

via RT Tue, 05 Jul 2016 14:11:56 -0700

# New Ticket Created by  Zefram 
# Please include the string:  [perl #128550]
# in the subject line of all future correspondence about this issue. 
# <URL: https://rt.perl.org/Ticket/Display.html?id=128550 >



Built-in character classes such as <lower> consistently accept any
diacritics on a matching base character, matching the whole grapheme:

> /^<lower>$/.ACCEPTS("u\x[308]").Bool
True
> /^<lower>$/.ACCEPTS("n\x[308]").Bool
True

Matching against a literal character or a <[abc]>-type enumerated
character class consistently rejects any diacritics on a matching base
character:

> /^<[nu]>$/.ACCEPTS("u\x[308]").Bool
False
> /^<[nu]>$/.ACCEPTS("n\x[308]").Bool
False

But a <[a..z]>-type range-based character class has inconsistent
behaviour:

> /^<[a..z]>$/.ACCEPTS("u\x[308]").Bool
False
> /^<[a..z]>$/.ACCEPTS("n\x[308]").Bool
True

The behaviour seems to be that if in NFC the first character of the
grapheme is the unadorned base character then it accepts, but if it's a
combined character then it rejects.  This dependence on the representation
breaks the grapheme view of the string, and so is presumably a bug.

I think a <[a..z]>-type range should, with respect to diacritics, behave
either like <lower> or like <[abc]>.  I am unable to discern which
is really intended; none of the documentation that I've seen addresses
grapheme semantics.  I note that matching a specified base character with
arbitrary diacritics is a meaningful facility, and given that <lower>
et al have that behaviour it should probably be available somewhere.
The character range feature is almost providing it, but it's obviously
not been designed to, because a single-character range such as <[n..n]>
is rejected.

-zefram

[perl #128550] [BUG] <[a..z]> ranges break grapheme awareness

Reply via email to