Mike Lambert:
(a bunch of stuff about regexes)

No offense intended, but I had trouble understanding that, and I helped
come up with the thing.  :^)  So, I'll try to interpret.

In Perl 5, we came up against the problem of simply running out of
characters in regexes.  To deal with this, Larry came up with the
(?_regex) syntax, where _ is some character.  Although a clever use of
an otherwise impossible sequence, it's also gratuitously ugly.

Consider the many roles (?_) plays:

        Non-capturing parentheses: (?:)
        Look(ahead|behind)s: (?=), (?!), (?<=), (?<!)
        Inline code: (?{}), (??{})
        Inline modifiers: (?imsx-ismx), (?imsx-ismx)
        Conditionals: (?()), (?()|)
        Comments: (?#)
        Non-backtracking: (?>)

Obviously, this is getting out of hand--using more than one or two of
those constructs makes your regex much harder to read.

Let's first tackle non-capturing parentheses and lookarounds.  If we
think about what metacharacters are around, we can realize that {} is
only legal with numbers inside it.  [0]  That means that we can probably
reuse it.  If we think about it, we can derive a few basic categories:

        -consuming (_) or not (|) [1]
                Reasoning: _ is fat, | is skinny
        -positive (=) or negative (!)
                Reasoning: same as in Perl 5
        -forwards (>) or backwards (<)
                Reasoning: same as in Perl 5

The characters in parentheses are prefix characters that indicate which
is to be used.  A simple mapping of the five things this section covers
follows:

        Perl 5          Perl 6
        ------          ------
        (?:regex)               {_=>regex}
        (?=regex)               {|=>regex}
        (?!regex)               {|!>regex}
        (?<=regex)              {|=<regex} [2]
        (?<!regex)              {|!<regex}

Obviously, that's a bit much to type.  But if we define some reasonable
defaults, it becomes more manageable.  By default, the specifier is _=>.
So here's a map of what you're more likely to see in a regex:

        Perl 5          Perl 6
        ------          ------
        (?:regex)               {regex}
        (?=regex)               {|regex}
        (?!regex)               {|!regex}
        (?<=regex)              {|<regex}
        (?<!regex)              {|!<regex}

However, the sharp reader might have noticed that there were three
possibilities missing from the above tables.  That's right--we get free
features too!

        (_!>regex)      --      Nonsensical.
        {_=<regex)      --      Match backwards. [3]
        {_!<regex)      --      Nonsensical.

Well, one free feature--we end up with reversed regexes from this deal.
The final table looks like this:

        Perl 5          Perl 6
        ------          ------
        (?:regex)               {regex}
        N/A                     {<regex}
        (?=regex)               {|regex}
        (?!regex)               {|!regex}
        (?<=regex)              {|<regex}
        (?<!regex)              {|!<regex}

He then went on to describe something I didn't understand at all.
Sorry.

--- BEGIN MY THOUGHTS ---

The only major drawback I can see to that is the naïve user might type
{<b>.*?</b>}+ expecting a bunch of text in bold tags and getting a
lookbehind instead--so it may be wise to leave the | and _ specifiers
out of this altogether, and come up with a better way.  I'll address
that point shortly.

In the mean time, let's consider some of the other syntaxes.  The inline
code tings are a good opportunity for improvement--and they have a good
alternative.  In Perl 5, ({ ought not to be legal, but it is--it's
hacked in to be the same as (\{.  So, we can drop a question mark from
each of the block forms, getting ({code}) and (?{code}.  However, we can
go even further by combining the two.

Here's how it works:
        -If the code returns undef, we backtrack.
        -If the code returns the empty string, we move on.
        -If the code returns anything else, we interpolate that into the
regex.

So, we now just have ({}).

Comments can go, since Larry has said that /x will be on by default
anyway.

That leaves conditionals, non-backtracking sections, inline modifiers,
and (maybe) non-capturing parens.  We now have three characters that
aren't valid in these places: *, +, and ?.

My suggestion is this:

        Thing                   Syntax          Logic
        -----                   ------          -----
        Conditionals    (?()|)          The question mark makes sense
for a conditional.
        Inline Modifiers        (?imsx-imsx)    Might as well be a
little bit compatible.
        Non-backtracking        (+)                     + requires more
than * does.
        Non-capturing   (*)                     Suggestions welcome.
:^)

So, my final suggestions are:

        Perl 5          Perl 6
        ------          ------
        (?:)                    (*)
        (?=)                    {}
        (?!)                    {!}
        (?<=)                   {<}
        (?<!)                   {<!}    [4]
        (?())                   (?())
        (?()|)          (?()|)
        (?imsx-imsx)    (?imsx-imsx)
        (?imsx-imsx:)   (?imsx-imsx:)
        (?>)                    (+)
        (?{})                   ({}) returning empty string
        (??{})          ({}) returning a string or regex
        (?#)                    N/A--obsolete

Please feel free to comment on these.

[0] Perl won't be the first tool to take advantage of this--lex uses
something similar for named subexpressions.

[1] Neither of these characters is ideal, however.  | looks like !, and
_ might reasonably be at the beginning of this sort of thing anyway.
Better suggestions are welcome.

[2] Mike originally had all the backwards matches as sexegers.  I think
this is a bad idea, but feel obligated to mention that.

[3] This seems a bit useless to me too.  It's probably more useful to
have a /r modifier on the entire regex.

[4] I changed the ordering for this one to avoid an ambiguity.

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

#define private public
    --Spotted in a C++ program just before a #include

Reply via email to