Tom Christiansen wrote:
> Has anybody specifically looked at how Perl6 regexes might map to
> the various requirements of UTS#18, Unicode Regular Expressions?
>
> http://unicode.org/reports/tr18/
Roughly Perl6 supports Level 2 (I did a fast check of UTS#18 and the specs).
> I ask because to my inexperienced eye, quite a few perl6isms are
> *much* better at this than in perl5 obtain, and so I wondered
> whether this was by conscious intent and design. Is/Was it?
Seems intended.
> I'm also curious whether there are active plans to address the
> tr18 requirements in perl6 regexes. It would be a wonderful
> feather in perl6's cap to be able to legitimately claim Level 2
> or even Level 3 compliance, since besides perl5, only ICU right
> now manages even Level 1, with everybody else *very* far behind.
I would like to have:
- all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
- most of the features of ICU (e.g. transforms, localisation)
- normalization form, local-support and tailored $features on string
level (_not_ lexical context)
This means that any string can be in or can transformed into the form
- Byte
- NFD, NFC, NFKD, NFKC
- NFG (Default Grapheme Clusters)
- NFGT (Tailored Grapheme Cluster)
Tailored means, that the Graphem (NFG) needs a Language-Local (e.g.
German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation).
Without a Language-Local a NFG-string is handled as NFG-string (default).
Tailored also means, that the user (Perl6 programmer) can tailor the
relevant mechanisms (formatting, normalization, collation, properties,
case folding etc.).
> TR18 specifies three levels of support (Basic, Extended, and Tailored),
> with each having specific, reasonably well-defined requirements:
There is a lot of work in the UNICODE standard - using it costs nothing,
but saves time. E.g. the allowed characters for identifiers can be
defined with the Unicode properties 'ID_Start' and 'ID_Continue',
Grapheme with Grapheme_Base, Grapheme_Extend etc.
> =Level 1: Basic Unicode Support
[...]
> RL1.3 Subtraction and Intersection
IMHO not complete
> RL1.5 Simple Loose Matches
Hmm ...
> RL1.6 Line Boundaries
can be defined
> RL1.7 Supplementary Code Points
IMHO not specced
> =Level 2: Extended Unicode Support
> RL2.1 Canonical Equivalents
IMHO not specced
> RL2.4 Default Loose Matches RL2.5 Name Properties
RL2.6 Wildcard Properties
IMHO not specced
> =Level 3: Tailored Unicode Support
IMHO not specced
It would be easier to reference the appropriate chapters of the Unicode
standard in the specification of Perl6. This would make Unicode
test-cases reusable. And an implementation should always declare, which
features of Unicode are implemented (and which not) in which version of
Unicode.
Helmut Wollmersdorfer