Re: Perl6 regexes and UTS#18

Larry Wall Wed, 09 Feb 2011 12:53:25 -0800

On Sun, Feb 06, 2011 at 08:59:51PM +0100, Helmut Wollmersdorfer wrote:
: Tom Christiansen wrote:
: > Has anybody specifically looked at how Perl6 regexes might map to
: > the various requirements of UTS#18, Unicode Regular Expressions?
: >
: >     http://unicode.org/reports/tr18/
: 
: Roughly Perl6 supports Level 2 (I did a fast check of UTS#18 and the specs).


I believe the spec now supports all of Level 2 explicitly, more or less.

: > I ask because to my inexperienced eye, quite a few perl6isms are
: > *much* better at this than in perl5 obtain, and so I wondered
: > whether this was by conscious intent and design.  Is/Was it?
: 
: Seems intended.

Well, some of it is convergent evolution (or divergent, in some cases),
but we have paid a certain amount of attention to what the unicode
folks have to say over the years.

: > I'm also curious whether there are active plans to address the
: > tr18 requirements in perl6 regexes.  It would be a wonderful
: > feather in perl6's cap to be able to legitimately claim Level 2
: > or even Level 3 compliance, since besides perl5, only ICU right
: > now manages even Level 1, with everybody else *very* far behind.

Anyone who implements all of S05 can claim Level 2.

: I would like to have:
: - all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
: - most of the features of ICU (e.g. transforms, localisation)
: - normalization form, local-support and tailored $features on string
:   level (_not_ lexical context)
: 
: This means that any string can be in or can transformed into the form
: - Byte
: - NFD, NFC, NFKD, NFKC
: - NFG (Default Grapheme Clusters)
: - NFGT (Tailored Grapheme Cluster)
: 
: Tailored means, that the Graphem (NFG) needs a Language-Local (e.g.
: German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation).
: Without a Language-Local a NFG-string is handled as NFG-string
: (default).
: 
: Tailored also means, that the user (Perl6 programmer) can tailor the
: relevant mechanisms (formatting, normalization, collation,
: properties, case folding etc.).

I think we're mostly on the same page, though I think we'll need
to thrash out how much of this info is carried in the string type,
and how much is implicit to the current language's view of strings
(where the programmer gets to choose which viewpoint to take, such as
"graphemes", "codepoints", "bytes", etc.).  We want to give access to
strings from other viewpoints to the extent that we can, but the current
viewpoint will tend to produce different default results than other
viewpoints.  By default, for instance, Perl 6 is supposed to be warped
toward NFG semantics, and wants to view all strings through that lens
unless otherwise instructed.  The "lesser" strings will tend to look
more like buffers to such a view.

: > TR18 specifies three levels of support (Basic, Extended, and Tailored),
: > with each having specific, reasonably well-defined requirements:
: 
: There is a lot of work in the UNICODE standard - using it costs
: nothing, but saves time. E.g. the allowed characters for identifiers
: can be defined with the Unicode properties 'ID_Start' and
: 'ID_Continue', Grapheme with Grapheme_Base, Grapheme_Extend etc.
: 
: >   =Level 1: Basic Unicode Support
: [...]
: >    RL1.3    Subtraction and Intersection
: 
: IMHO not complete

Fixed.  Now have |, &, and ^ with expected precedence, and () for bracketing.

: >    RL1.5    Simple Loose Matches
: 
: Hmm ...

Clarified.

: >    RL1.6    Line Boundaries
: 
: can be defined

Now we have an extensible boundary syntax with a different namespace than
normal rules.  <|w> is a word boundary, etc.

: >    RL1.7    Supplementary Code Points
: 
: IMHO not specced

Yes, already implicit to the NFG view of reality.

: >   =Level 2: Extended Unicode Support
: >    RL2.1    Canonical Equivalents
: 
: IMHO not specced

To me, this is also implied by NFG semantics.

: >    RL2.4    Default Loose Matches       RL2.5    Name Properties
: RL2.6    Wildcard Properties
: 
: IMHO not specced

Was already specced as a "smartmatch", but I've made the name matching more 
explicit.

: >   =Level 3: Tailored Unicode Support
: 
: IMHO not specced

We've had :chars conjecturing tailoring for quite a while now.  Of course,
the actual syntax is negotiable.

: It would be easier to reference the appropriate chapters of the
: Unicode standard in the specification of Perl6. This would make
: Unicode test-cases reusable. And an implementation should always
: declare, which features of Unicode are implemented (and which not)
: in which version of Unicode.

We can certainly use a lot more work on the details, but the intent
should be clear that we want Perl 6 to be support world-class Unicode.

Larry

Re: Perl6 regexes and UTS#18

Reply via email to