Patrick wrote at 12:15pm CST on Wednesday, 10 November 2010: >> Sorry if this is the wrong forum. I was wondering if there was a way to >> specify unicode >> categories<http://www.fileformat.info/info/unicode/category/index.htm>in >> a regular expression (and hence a grammar), or if there would be any >> consideration for adding support for that (requiring some kind of special >> syntax).
> Unicode categories are done using assertion syntax with "is" followed by > the category name. Thus <isLu> (uppercase letter), <isNd> (decimal digit), > <isZs> (space separator), etc. > This even works in Rakudo today: > $ ./perl6 > > say 'abcdEFG' ~~ / <isLu> / > E > They can also be combined, as in +isLu+isLt (uppercase+titlecase). > The relevant section of the spec is in Synopsis 5; search for "Unicode > properties are always available with a prefix". > Hope this helps! Actually, that quote from Synopsis raises more questions than it answers. Below I've annonated the three output groups with (letters): % uniprops -a A U+0041 ‹A› \N{ LATIN CAPITAL LETTER A }: (A) \w \pL \p{LC} \p{L_} \p{L&} \p{Lu} (B) AHex ASCII_Hex_Digit All Any Alnum Alpha Alphabetic ASCII Assigned Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase Hex XDigit Hex_Digit ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Uppercase_Letter PerlWord PosixAlnum PosixAlpha PosixGraph PosixPrint PosixUpper Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS (C) Age:1.1 Block=Basic_Latin Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:ASCII Block:Basic_Latin Blk=ASCII Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None East_Asian_Width:Na East_Asian_Width=Narrow East_Asian_Width:Narrow Ea=Na Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Script=Latin Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Latin Sc=Latn Script:Latn Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter What that means is that the "B" properties are properties from the *General* category. They may all be referred to as \p{X} or \p{IsX}, \p{General_Category=X} or \p{General_Category:X}, and \p{GC=X} or \p{GC:X}. I have a feeling that your synopsis quote is referring only to type B properties alone. It is not talking about type C properties, which must also be accounted for. --tom