# New Ticket Created by  "Carl Mäsak" 
# Please include the string:  [perl #72440]
# in the subject line of all future correspondence about this issue. 
# <URL: http://rt.perl.org/rt3/Ticket/Display.html?id=72440 >


This be Rakudo a609d7 on Parrot r43600.

$ perl6 -e 'say "1ab2ab3c" ~~ /^ \d ** abc $/ ?? "OH NOES" !! "oh phew"'
OH NOES

This is a PGE bug. Here follows a brief explanation.

S05 states that unquotes literals like C<abc> are actually three
distinct atoms, each of which can be quantified separately. Thus,
C<abc*> means C<ab[c]*>, not C<[abc]*>. With that reasoning, C<\d **
abc> means C<\d ** [a] bc>.

However (though S05, to my knowledge, does not mention it), one might
perhaps temporarily lift the rule about each unquoted alphanumeric
character being its own atom in "** separator context". In that case,
C<\d ** abc> could be made to mean C<\d ** [abc]>. (I'm not saying
this exception would be a good idea, language-wise.)

In PGE, as we see above, C<\d ** abc> currently means C<\d ** [ab] c>.
This is due to an internal optimization that's usually invisible to
the user. When parsing C<abc>, PGE conveniently reads it as C<'ab' c>
or, more generally, it reads all characters in an unquoted literal,
save for the last character. This optimization makes a lot of sense if
it turns out that C<c> had a quantifier on it. Later steps in the
regex compilation merge the C<ab> and C<c> into one literal string if
it didn't.

In the case of the separator in C<**>, this optimization produces the
wrong results. At the time C<ab> and C<c> would be merged, C<ab> has
already been bound as the separator of the C<**> operator.

I probably wouldn't submit this as a rakudobug, were it not for the
fact that, according to my reading of
<http://github.com/perl6/nqp-rx/blob/eb9c75a9b6bf144808ca6d24f31b606e9e8adba8/src/Regex/P6Regex/Grammar.pm>
(lines 47 and 67), this problem persists in nqp-rx, and thus in the ng
branch of Rakudo, once it supports regex matching.

For what it's worth, I suggest that /\d ** abc/ actually be
interpreted as /\d ** [a] bc/, but that a (suppressible) warning be
emitted whenever an atom follows a quantifier separator with no
whitespace in between.

// Carl

Reply via email to