Re: Perl6 and "accents"

Tom Christiansen Mon, 17 May 2010 17:55:46 -0700

> Why isn't that:

>  /<+ alpha - [A-Za-z]>+ /


If you're asking why it's mentioned in the "Update:" section
instead of the pattern in question just being rewritten, I don't know.

What got me most was the assumption that subtracting A-Za-z from Alphas
yielded "accented characters", as though Alpha meant A-Za-z.  It doesn't.

There are ***plenty*** of letters that aren't A-Z or a-z:  Latin letters
like ETH, THORN, WYNN, ESZETT, and others--plus all the letters from Greek,
Cyrillic, and the rest of the non-Latin scripts, too.  Also, within the
letters there are the non-casing \p{Lo} and \p{Lm} letters. 

The statement was more false than true.

>> I'm also disappointed to see perl6 spreading the notion that "accent"
>> is somehow a valid synonym for

>>    diacritical marking
>>    diacritic marking
>>    diacritic mark
>>    diacritic
>>    mark

>> It's not.  Accent is not a synonym for any of those.  Not all
>> marks are accents, and not all accents are marks.

> I agree that it's a rather "folksy" way of saying "them funny
> letters." On the other hand, I think that was the intent. It's
> very hard to find ways to describe Unicode spaces in ways that
> the average coder (not the average person, which is a small
> help) will grasp immediately. diacritical isn't a word that
> most folks know, even among programmers.

Certainly it's perfectly well known amongst people who deal with
letters--including with the Unicode standard.

> "Accent" does have a colloquial meaning that maps correctly,
> but sadly that colloquial definition does not correspond to
> the technical definition, so in being clear, you become less
> accurate. There is, as far as I'm aware, no good middle
> ground, here.

One doesn't *have* to make up play-words.  There's nothing wrong with the
correct terminology.  Calling a mark a mark is pretty darned simple.

Unicode has blocks for diacritic marks, and a Diacritic property for
testing whether something is one.  There are 1328 code points whose
canonical decompositions have both both \p{Diacritic} and \pM in them,
946 code points that have only \pM but not \p{Diacritic}, and 197 that 
have \p{Diacritic} but not \pM.

I still think resorting to talking about "accent marks" is a bad idea.  
I had somebody the other day thinking that "throwing out the accent marks"
meant deleting all characters whose code points were over 0x7F--and this
was a recent CompSci major, too.

But that's nothing.  The more you look into it, the weirder it can get,
especially with collation and canonical equivalence, both of which really
require locale knowledge outside the charset itself.
 
--tom

Re: Perl6 and "accents"

Reply via email to