Re: Unicode Categories

Tom Christiansen Wed, 10 Nov 2010 11:05:41 -0800

Patrick wrote at 12:15pm CST on Wednesday, 10 November 2010:

>> Sorry if this is the wrong forum. I was wondering if there was a way to
>> specify unicode
>> categories<http://www.fileformat.info/info/unicode/category/index.htm>in
>> a regular expression (and hence a grammar), or if there would be any
>> consideration for adding support for that (requiring some kind of special
>> syntax).


> Unicode categories are done using assertion syntax with "is" followed by
> the category name.  Thus <isLu> (uppercase letter), <isNd> (decimal digit), 
> <isZs> (space separator), etc.

> This even works in Rakudo today:

>    $ ./perl6
>    > say 'abcdEFG' ~~ / <isLu> /
>    E

> They can also be combined, as in +isLu+isLt  (uppercase+titlecase).
> The relevant section of the spec is in Synopsis 5; search for "Unicode
> properties are always available with a prefix".
 
> Hope this helps!

Actually, that quote from Synopsis raises more questions than it answers.

Below I've annonated the three output groups with (letters):

    % uniprops -a A
    U+0041 ‹A› \N{ LATIN CAPITAL LETTER A }:
 (A)    \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
 (B)    AHex ASCII_Hex_Digit All Any Alnum Alpha Alphabetic ASCII Assigned
            Cased Cased_Letter LC Changes_When_Casefolded CWCF
            Changes_When_Casemapped CWCM Changes_When_Lowercased CWL
            Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph
            GrBase Hex XDigit Hex_Digit ID_Continue IDC ID_Start IDS Letter L_
            Latin Latn Uppercase_Letter PerlWord PosixAlnum PosixAlpha
            PosixGraph PosixPrint PosixUpper Print Upper Uppercase Word
            XID_Continue XIDC XID_Start XIDS
 (C)    Age:1.1 Block=Basic_Latin Bidi_Class:L Bidi_Class=Left_To_Right
            Bidi_Class:Left_To_Right Bc=L Block:ASCII Block:Basic_Latin
            Blk=ASCII Canonical_Combining_Class:0
            Canonical_Combining_Class=Not_Reordered
            Canonical_Combining_Class:Not_Reordered Ccc=NR
            Canonical_Combining_Class:NR Decomposition_Type:None Dt=None
            East_Asian_Width:Na East_Asian_Width=Narrow East_Asian_Width:Narrow
            Ea=Na Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
            Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
            Hangul_Syllable_Type=Not_Applicable
            Hangul_Syllable_Type:Not_Applicable Hst=NA
            Joining_Group:No_Joining_Group Jg=NoJoiningGroup
            Joining_Type:Non_Joining Jt=U Joining_Type:U
            Joining_Type=Non_Joining Script=Latin Line_Break:AL
            Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None
            Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
            Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0
            Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0
            Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1
            Present_In:5.2 In=5.2 Script:Latin Sc=Latn Script:Latn
            Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP
            Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter

What that means is that the "B" properties are properties from 
the *General* category.  They may all be referred to as \p{X} 
or \p{IsX}, \p{General_Category=X} or \p{General_Category:X}, 
and \p{GC=X} or \p{GC:X}.

I have a feeling that your synopsis quote is referring only to 
type B properties alone.  It is not talking about type C properties, 
which must also be accounted for.

--tom

Re: Unicode Categories

Reply via email to