On Sun, Feb 06, 2011 at 08:59:51PM +0100, Helmut Wollmersdorfer wrote: : Tom Christiansen wrote: : > Has anybody specifically looked at how Perl6 regexes might map to : > the various requirements of UTS#18, Unicode Regular Expressions? : > : > http://unicode.org/reports/tr18/ : : Roughly Perl6 supports Level 2 (I did a fast check of UTS#18 and the specs).
I believe the spec now supports all of Level 2 explicitly, more or less. : > I ask because to my inexperienced eye, quite a few perl6isms are : > *much* better at this than in perl5 obtain, and so I wondered : > whether this was by conscious intent and design. Is/Was it? : : Seems intended. Well, some of it is convergent evolution (or divergent, in some cases), but we have paid a certain amount of attention to what the unicode folks have to say over the years. : > I'm also curious whether there are active plans to address the : > tr18 requirements in perl6 regexes. It would be a wonderful : > feather in perl6's cap to be able to legitimately claim Level 2 : > or even Level 3 compliance, since besides perl5, only ICU right : > now manages even Level 1, with everybody else *very* far behind. Anyone who implements all of S05 can claim Level 2. : I would like to have: : - all Unicode features of Perl5 (UCD, charnames, Normalize, properties) : - most of the features of ICU (e.g. transforms, localisation) : - normalization form, local-support and tailored $features on string : level (_not_ lexical context) : : This means that any string can be in or can transformed into the form : - Byte : - NFD, NFC, NFKD, NFKC : - NFG (Default Grapheme Clusters) : - NFGT (Tailored Grapheme Cluster) : : Tailored means, that the Graphem (NFG) needs a Language-Local (e.g. : German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation). : Without a Language-Local a NFG-string is handled as NFG-string : (default). : : Tailored also means, that the user (Perl6 programmer) can tailor the : relevant mechanisms (formatting, normalization, collation, : properties, case folding etc.). I think we're mostly on the same page, though I think we'll need to thrash out how much of this info is carried in the string type, and how much is implicit to the current language's view of strings (where the programmer gets to choose which viewpoint to take, such as "graphemes", "codepoints", "bytes", etc.). We want to give access to strings from other viewpoints to the extent that we can, but the current viewpoint will tend to produce different default results than other viewpoints. By default, for instance, Perl 6 is supposed to be warped toward NFG semantics, and wants to view all strings through that lens unless otherwise instructed. The "lesser" strings will tend to look more like buffers to such a view. : > TR18 specifies three levels of support (Basic, Extended, and Tailored), : > with each having specific, reasonably well-defined requirements: : : There is a lot of work in the UNICODE standard - using it costs : nothing, but saves time. E.g. the allowed characters for identifiers : can be defined with the Unicode properties 'ID_Start' and : 'ID_Continue', Grapheme with Grapheme_Base, Grapheme_Extend etc. : : > =Level 1: Basic Unicode Support : [...] : > RL1.3 Subtraction and Intersection : : IMHO not complete Fixed. Now have |, &, and ^ with expected precedence, and () for bracketing. : > RL1.5 Simple Loose Matches : : Hmm ... Clarified. : > RL1.6 Line Boundaries : : can be defined Now we have an extensible boundary syntax with a different namespace than normal rules. <|w> is a word boundary, etc. : > RL1.7 Supplementary Code Points : : IMHO not specced Yes, already implicit to the NFG view of reality. : > =Level 2: Extended Unicode Support : > RL2.1 Canonical Equivalents : : IMHO not specced To me, this is also implied by NFG semantics. : > RL2.4 Default Loose Matches RL2.5 Name Properties : RL2.6 Wildcard Properties : : IMHO not specced Was already specced as a "smartmatch", but I've made the name matching more explicit. : > =Level 3: Tailored Unicode Support : : IMHO not specced We've had :chars conjecturing tailoring for quite a while now. Of course, the actual syntax is negotiable. : It would be easier to reference the appropriate chapters of the : Unicode standard in the specification of Perl6. This would make : Unicode test-cases reusable. And an implementation should always : declare, which features of Unicode are implemented (and which not) : in which version of Unicode. We can certainly use a lot more work on the details, but the intent should be clear that we want Perl 6 to be support world-class Unicode. Larry