Re: Regex named-group and backreference syntax

Xueming Shen Wed, 02 Sep 2009 00:16:30 -0700

Hi Alan,

It would be an "ambiguity" (and then confused) only if we had the \k<n>and $<n> as the legallysupported group reference syntax:-) That said I have to admit that itdoes not have any value-addto allow the a group name begins with a digit character. So if we have aconsensus I would behappy to change the spec/implementation to dis-allow the digit letterstarted group name.

I kinda disagree that the "rest of the named-group syntax" is copiedfrom .Net. Actually it isthe syntax from Perl 5.10.0/named capture buffer, in which the namingsyntax is (?<NAME>....)

and to backreference it with the \k<NAME>. I did not find a "reference
of named capture buffer in replacement" from there. I did consider to use
the .Net syntax, but finally decided to go with $<name> because it is more
consistent with the (?<name>...) and \k<name> syntax.

To allow \k<n> and $<n> is a fine idea, it at least looks less "complicated"
in replacement case.

Sherman

Alan Moore wrote:

Looking at the new named-capture feature, two things jump out at me.
The first is that the rules governing group names make "0", "1", "2",
etc. valid names.  That's bound to cause confusion, as programmers use
\k<1> in the regex, or $<1> in the replacement string, meaning them as
ordinal backreferences.  It will be even worse if they actually have a
group named "1", which may or may not be the first (numbered) group.

Does this ambiguity add any value to offset the potential confusion?
Because it seems to me we could add even more value by disallowing
names that start with digits.  We could still allow \k<1> and $<1> and
such as backreferences, but they would be aliases for \1 and $1
respectively.  The advantage is that a backreference in one of those
forms could be followed by another digit and there would be no danger
of forming a different capture-group reference.

For example, $10 could mean group(1) followed by zero, or group(10) if
the regex has that many groups.  If it's group(1) you want, you can
escape the zero with a backslash to make that clear.  But what if you
really mean group(10) but there's no such group?  You won't be
notified of your error, because the Matcher assumes you meant group(1)
plus "0".  But with \k<1> and $<1> there's no ambiguity and no need to
escape anything.

My other concern is the syntax of backreferences in the replacement
string: $<name>.  Surveying the other major players (i.e.,
named-capture-enabled regex flavors associated with popular
programming languages), ${name} seems to be the most common
syntax--though there aren't a whole lot of data points yet, I admit.
Most significantly, .NET does it that way, and we're copying them on
the rest of the named-group syntax already, so why not on this?  Also,
I don't know of any other flavor that uses the $<name> syntax.

To summarize, I want to:

- change the replacement-string backreference syntax from $<name> to ${name}

- disallow group names starting with digits

- allow backreferences of the form \k<n> and ${n} where 'n' is one or
more digits, but interpret them as ordinal instead of named references
(and throw an exception if there's no such group).

Thoughts?

Re: Regex named-group and backreference syntax

Reply via email to