Perl6 and "accents"

Tom Christiansen Mon, 17 May 2010 10:53:01 -0700

Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:


  # Perl 6
  / < <alpha> - [A-Za-z] >+ /   # All alphabetics except A-Z or a-z
                                # (i.e. the accented alphabetics)

    [Update: Would now need to be <+<alpha> - [A..Za..z]> to avoid ambiguity
    with "Texas quotes", and because we want to reserve whitespace as the first
    character inside the angles for other uses.]

    Explicit character classes were deliberately made a little less convenient
    in Perl 6, because they're generally a bad idea in a Unicode world. For
    example, the [A-Za-z] character class in the above examples won't even
    match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone
    alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham,
    Cherokee, or Klingon.

First off, that "i.e. the accented alphabetics" phrasing is quite incorrect!  
Code like /[^\P{Alpha}A-Za-z]/ matches not just things like

    00C1 LATIN CAPITAL LETTER A WITH ACUTE
    00C7 LATIN CAPITAL LETTER C WITH CEDILLA
    00C8 LATIN CAPITAL LETTER E WITH GRAVE
    00E5 LATIN SMALL LETTER A WITH RING ABOVE
    00F1 LATIN SMALL LETTER N WITH TILDE

but also of course:

    00AA FEMININE ORDINAL INDICATOR
    00B5 MICRO SIGN
    00BA MASCULINE ORDINAL INDICATOR
    00C6 LATIN CAPITAL LETTER AE
    00D0 LATIN CAPITAL LETTER ETH
    00DE LATIN CAPITAL LETTER THORN
    00DF LATIN SMALL LETTER SHARP S
    00E6 LATIN SMALL LETTER AE
    00F0 LATIN SMALL LETTER ETH
    01A6 LATIN LETTER YR
    01BA LATIN SMALL LETTER EZH WITH TAIL
    01BC LATIN CAPITAL LETTER TONE FIVE
    01BF LATIN LETTER WYNN
    02C7 CARON
    0391 GREEK CAPITAL LETTER ALPHA
    0410 CYRILLIC CAPITAL LETTER A

and many, many more.

I'm also disappointed to see perl6 spreading the notion that "accent"
is somehow a valid synonym for 

    diacritical marking 
    diacritic marking 
    diacritic mark
    diacritic 
    mark

It's not.  Accent is not a synonym for any of those.  Not all marks are
accents, and not all accents are marks.

I believe what is meant by "accent" is NFD($char) =~ /\pM/.  Fine: then
say "with diacritics", not "with accents".    

Also, there are many combining characters that aren't "accents" by any
stretch of term, such as 20E3 COMBINING ENCLOSING KEYCAP, to name just one.
Only three code points have official names that include "ACCENT", and even
these are dubious.

Finally, I note also that people use the Alpha property too loosely.  Note
the caron and such above.  One probably wants the LC property instead.

--tom

    use charnames ();
    use Unicode::Normalize;
    for $cp ( 1 .. 0xffff ) {
        $orig  = chr($cp);
        $canon  = NFD($orig);  # NFKD gives diff results
        ## if ($orig =~ /[^\P{Alpha}A-Za-z]/) {
        if ($orig =~ /\p{LC}/ && $canon !~ /^[A-Za-z]/) {
            printf("%c %04X %s\n", $cp, $cp, charnames::viacode($cp));
        }
    }

Perl6 and "accents"

Reply via email to