Larry Wall wrote:
On Sun, Feb 06, 2011 at 08:59:51PM +0100, Helmut Wollmersdorfer wrote:
: Tom Christiansen wrote:
: > I'm also curious whether there are active plans to address the
: > tr18 requirements in perl6 regexes. It would be a wonderful
: > feather in perl6's cap to be able to legitimately claim Level 2
: > or even Level 3 compliance, since besides perl5, only ICU right
: > now manages even Level 1, with everybody else *very* far behind.
Anyone who implements all of S05 can claim Level 2.
I will remember this day as a big milestone.
: I would like to have:
: - all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
: - most of the features of ICU (e.g. transforms, localisation)
: - normalization form, local-support and tailored $features on string
: level (_not_ lexical context)
:
: This means that any string can be in or can transformed into the form
: - Byte
: - NFD, NFC, NFKD, NFKC
: - NFG (Default Grapheme Clusters)
: - NFGT (Tailored Grapheme Cluster)
:
: Tailored means, that the Graphem (NFG) needs a Language-Local (e.g.
: German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation).
: Without a Language-Local a NFG-string is handled as NFG-string
: (default).
:
: Tailored also means, that the user (Perl6 programmer) can tailor the
: relevant mechanisms (formatting, normalization, collation,
: properties, case folding etc.).
I think we're mostly on the same page, though I think we'll need
to thrash out how much of this info is carried in the string type,
and how much is implicit to the current language's view of strings
(where the programmer gets to choose which viewpoint to take, such as
"graphemes", "codepoints", "bytes", etc.). We want to give access to
strings from other viewpoints to the extent that we can, but the current
viewpoint will tend to produce different default results than other
viewpoints. By default, for instance, Perl 6 is supposed to be warped
toward NFG semantics, and wants to view all strings through that lens
unless otherwise instructed. The "lesser" strings will tend to look
more like buffers to such a view.
I agree. NFG is the most convenient mode as a general view of strings.
And it conforms to Level 2, where Unicode guaranties stability.
Sure, if one wants to work at "codepoint" (NFC, NFD) level, he must be
aware, that "2.1 Canonical Equivalents" is not fullfilled.
: > =Level 2: Extended Unicode Support
: > RL2.1 Canonical Equivalents
: IMHO not specced
To me, this is also implied by NFG semantics.
I agree. After a longish discussion with Tom presenting dozens of
corner-cases to me, we now know better that and why it works with all
the normalization rules.
: > RL2.4 Default Loose Matches RL2.5 Name Properties
: RL2.6 Wildcard Properties
:
: IMHO not specced
Was already specced as a "smartmatch", but I've made the name matching more
explicit.
Thx
: > =Level 3: Tailored Unicode Support
:
: IMHO not specced
We've had :chars conjecturing tailoring for quite a while now. Of course,
the actual syntax is negotiable.
: It would be easier to reference the appropriate chapters of the
: Unicode standard in the specification of Perl6. This would make
: Unicode test-cases reusable. And an implementation should always
: declare, which features of Unicode are implemented (and which not)
: in which version of Unicode.
We can certainly use a lot more work on the details, but the intent
should be clear that we want Perl 6 to be support world-class Unicode.
Agreed.
Tailoring as kind of overloading breaks - intended - the standard behaviour.
But an implementation of NFG must have a mechanism to assign codepoints
and provide properties of NFG-codepoints during runtime. The same
mechanism can be opened for tailoring, and - we should not forget them -
characters in the private range. Of course, overloading existing
properties like case folding of the Turkish 'i' is not the same as just
adding NFG- or private properties.
Helmut Wollmersdorfer