Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:
# Perl 6 / < <alpha> - [A-Za-z] >+ / # All alphabetics except A-Z or a-z # (i.e. the accented alphabetics) [Update: Would now need to be <+<alpha> - [A..Za..z]> to avoid ambiguity with "Texas quotes", and because we want to reserve whitespace as the first character inside the angles for other uses.] Explicit character classes were deliberately made a little less convenient in Perl 6, because they're generally a bad idea in a Unicode world. For example, the [A-Za-z] character class in the above examples won't even match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham, Cherokee, or Klingon. First off, that "i.e. the accented alphabetics" phrasing is quite incorrect! Code like /[^\P{Alpha}A-Za-z]/ matches not just things like 00C1 LATIN CAPITAL LETTER A WITH ACUTE 00C7 LATIN CAPITAL LETTER C WITH CEDILLA 00C8 LATIN CAPITAL LETTER E WITH GRAVE 00E5 LATIN SMALL LETTER A WITH RING ABOVE 00F1 LATIN SMALL LETTER N WITH TILDE but also of course: 00AA FEMININE ORDINAL INDICATOR 00B5 MICRO SIGN 00BA MASCULINE ORDINAL INDICATOR 00C6 LATIN CAPITAL LETTER AE 00D0 LATIN CAPITAL LETTER ETH 00DE LATIN CAPITAL LETTER THORN 00DF LATIN SMALL LETTER SHARP S 00E6 LATIN SMALL LETTER AE 00F0 LATIN SMALL LETTER ETH 01A6 LATIN LETTER YR 01BA LATIN SMALL LETTER EZH WITH TAIL 01BC LATIN CAPITAL LETTER TONE FIVE 01BF LATIN LETTER WYNN 02C7 CARON 0391 GREEK CAPITAL LETTER ALPHA 0410 CYRILLIC CAPITAL LETTER A and many, many more. I'm also disappointed to see perl6 spreading the notion that "accent" is somehow a valid synonym for diacritical marking diacritic marking diacritic mark diacritic mark It's not. Accent is not a synonym for any of those. Not all marks are accents, and not all accents are marks. I believe what is meant by "accent" is NFD($char) =~ /\pM/. Fine: then say "with diacritics", not "with accents". Also, there are many combining characters that aren't "accents" by any stretch of term, such as 20E3 COMBINING ENCLOSING KEYCAP, to name just one. Only three code points have official names that include "ACCENT", and even these are dubious. Finally, I note also that people use the Alpha property too loosely. Note the caron and such above. One probably wants the LC property instead. --tom use charnames (); use Unicode::Normalize; for $cp ( 1 .. 0xffff ) { $orig = chr($cp); $canon = NFD($orig); # NFKD gives diff results ## if ($orig =~ /[^\P{Alpha}A-Za-z]/) { if ($orig =~ /\p{LC}/ && $canon !~ /^[A-Za-z]/) { printf("%c %04X %s\n", $cp, $cp, charnames::viacode($cp)); } }