.graphemes methods

Larry Wall Wed, 07 Jul 2004 20:10:08 -0700

On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote:
: On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote:
: > This has no direct bearing on p6l, since performance is a p6i issue.
: > But perhaps in the interests of performance as well as hackery we
: > should explicitly provide some sort of variant regex behavior:
: > 
: >     /a./ :bytes
: >     /a./ :graphemes
: > 
: > where the first would recognize 0x61 followed by any single byte, while
: > the second would recognize 'a' followed by any number of bytes
: > composing a single grapheme.
: 
: Isn't that what :u0, :u1, :u2, and :u3 are for?
: 
:           :u0         # use bytes       (. is byte)
:           :u1         # level 1 support (. is codepoint)
:           :u2         # level 1 support (. is grapheme)
:           :u3         # level 1 support (. is language dependent)


These modifiers might get renamed to match whatever b/c/g/w convention
we come up with pragmas.  The levels aren't all that intuitive, though
there is a kind of progression of semantic complexity that would get
lost with ordinary names.

:         These modifiers say nothing about the state of the data, but in
:         general internal Perl data will already be in Normalization Form
:         C, so even under :u1, the precomposed characters will usually do
:         the right thing.

These days it might be that most of the data we see will be maximally
decomposed rather than maximally composed.  But the jury is still out
on that.  And in any event, :u2 and :u3 should hide that distinction.

:         Note that these modifiers are for overriding
:         the default support level, which was probably set by pragma at
:         the top of the file.

Another way of saying that is that these modifiers are, in fact,
lexically scoped pragmas with the *exact* same effect as the ordinary
Unicode level pragmas.  It's just that they're lexically scoped to
the rest of a rule or group rather than to the rest of a block.

: Or was that to imply that a literal "a" in the RE would be
: interpretted as a "grapheme a" when :u2 is active?

I don't know what you mean by "grapheme a" there.  If you mean, "Does
it match any grapheme that happens to be exactly U+0061?", then the
answer is yes.  If you mean "Does it wildcard to any grapheme that uses
U+0061 as the base character?", then the answer is probably no.  We
have not yet come up with a syntax for that kind of wildcarding, other
than dropping down to codepoints [:u1 a \pM+] or some such.  That may
or may not be sufficient.  It'd be pretty easy to define a <like a>
assertion in any case.

Larry

Re: The .bytes/.codepoints/.graphemes methods

Reply via email to