Re: Perl6 regexes and UTS#18

Helmut Wollmersdorfer Sun, 06 Feb 2011 12:00:10 -0800

Tom Christiansen wrote:
> Has anybody specifically looked at how Perl6 regexes might map to
> the various requirements of UTS#18, Unicode Regular Expressions?
>
>     http://unicode.org/reports/tr18/


Roughly Perl6 supports Level 2 (I did a fast check of UTS#18 and the specs).

> I ask because to my inexperienced eye, quite a few perl6isms are
> *much* better at this than in perl5 obtain, and so I wondered
> whether this was by conscious intent and design.  Is/Was it?

Seems intended.

> I'm also curious whether there are active plans to address the
> tr18 requirements in perl6 regexes.  It would be a wonderful
> feather in perl6's cap to be able to legitimately claim Level 2
> or even Level 3 compliance, since besides perl5, only ICU right
> now manages even Level 1, with everybody else *very* far behind.

I would like to have:
- all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
- most of the features of ICU (e.g. transforms, localisation)
- normalization form, local-support and tailored $features on string
  level (_not_ lexical context)

This means that any string can be in or can transformed into the form
- Byte
- NFD, NFC, NFKD, NFKC
- NFG (Default Grapheme Clusters)
- NFGT (Tailored Grapheme Cluster)

Tailored means, that the Graphem (NFG) needs a Language-Local (e.g.German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation).Without a Language-Local a NFG-string is handled as NFG-string (default).

Tailored also means, that the user (Perl6 programmer) can tailor therelevant mechanisms (formatting, normalization, collation, properties,case folding etc.).


> TR18 specifies three levels of support (Basic, Extended, and Tailored),
> with each having specific, reasonably well-defined requirements:

There is a lot of work in the UNICODE standard - using it costs nothing,but saves time. E.g. the allowed characters for identifiers can bedefined with the Unicode properties 'ID_Start' and 'ID_Continue',Grapheme with Grapheme_Base, Grapheme_Extend etc.


>   =Level 1: Basic Unicode Support
[...]
>    RL1.3    Subtraction and Intersection

IMHO not complete

>    RL1.5    Simple Loose Matches

Hmm ...

>    RL1.6    Line Boundaries

can be defined

>    RL1.7    Supplementary Code Points

IMHO not specced

>   =Level 2: Extended Unicode Support
>    RL2.1    Canonical Equivalents

IMHO not specced

> RL2.4 Default Loose Matches RL2.5 Name PropertiesRL2.6 Wildcard Properties


IMHO not specced

>   =Level 3: Tailored Unicode Support

IMHO not specced

It would be easier to reference the appropriate chapters of the Unicodestandard in the specification of Perl6. This would make Unicodetest-cases reusable. And an implementation should always declare, whichfeatures of Unicode are implemented (and which not) in which version ofUnicode.


Helmut Wollmersdorfer

Re: Perl6 regexes and UTS#18

Reply via email to