RE: Regex and Matched Delimiters

Brent Dax Mon, 22 Apr 2002 22:52:29 -0700

Larry Wall:
# Me writes:
# : > Very nice (but, I assume you meant {$foo data})!
# : 
# : I didn't mean that (even if I should have).
# : 
# : Aiui, Mike's final suggestion was that parens end up
# : doing all the (ops data) tricks, and braces are used
# : purely to do code insertions. (I really liked that idea.)
# : 
# : So:
# : 
# : Perl 5            Perl6
# : (data)            ( data)
# : (?opsdata)        (ops data)
# : ({})              {}  
# 
# Hmm.  Let me spill a few beans about where I'm going with A5. 
#  I've been thinking similar thoughts about the problem of 
# overloading parens so heavily in Perl 5, but I'm going in a 
# slightly different direction with it.  The basic principles 
# for the new regexen are:
# 
#     * Parens always capture.
#     * Braces are always closures.
#     * Square brackets are always character classes.
#     * Angle brackets are always metasyntax (along with backslash).
# 
# So a first whack at the differences might be:
# 
#     Old                       New
#     ---                       ---
#     //                        /<prior>/  ???
#     ?pat?                     /<?f:pat/  ???
#     /pat/i            m:i/pat/ or /<?i:pat>/ or even m<?i:pat> ???


Whoa, those are moving to the front?!?

#     /pat/x            /pat/
#     /^pat$/m          /^^pat$$/

That's...odd.  Is $$ (the variable) going away?

#     /./s                      /<any>/ or /<.>/ ???

I think that . is too common a metacharacter to be relegated to this.

#     \p{prop}          <+prop>  ???
#     \P{prop}          <-prop>  ???

Intriguing.

#     space             <sp> (or \h for "horizontal"?)

Same thinking as '.'.

#     {n,m}             <n,m>

Ah, OK.

#     \t                        also <tab>
#     \n                        also <lf> or <nl> (latter matching
logical newline)
#     \r                        also <cr>
#     \f                        also <ff>
#     \a                        also <bell>
#     \e                        also <esc>

I can tell you right now that these are going to screw people up.
They'll try to use these in normal strings and be confused when it
doesn't work.  And you probably won't be able to emit a warning,
considering how much CGI Perl munches.

#     \033                      same
#     \x1B                      same
#     \x{263a}          \x<263a> ???

Why?  Wouldn't we want the same thing to work in quoted strings?  (Or
are those changing syntaxes too?)

#     \c[                       same
#     \N{name}          <name>
#     \l                        same
#     \u                        same
#     \Lstring\E                \L<string>
#     \Ustring\E                \U<string>

So that's changed from whenever you talked about \q{} ?

#     \E                        gone
#     [\040\t]          \h        plus any Unicode horizontal whitespace
#     [\r\n\ck]         \v      plus any Unicode vertical whitespace
# 
#     \b                        same
#     \B                        same

#     \A                        ^
#     \Z                        same?
#     \z                        $

Are you sure that optimizes for the common case?

#     \G                        <pos>, but assumed in nested patterns?
#  
#     \1                        $1
# 
#     \Q$var\E          $var    always assumed literal, so $1 is literal
backref

So these are reinterpolated every time you backtrack?  Are you *trying*
to destroy regex performance?  :^)

#     $var                      <$var>  assumed to be regex

What if $var is a qr//ed object?

#     =~ $re            =~ /<$re>/   ouch?

I don't see the win.

#     (??{$rule})               <rule>
#     (?{ code })               { code } with failure semantics
#     (?#...)           {"..."}         :-)
#     (?:...)           <:...>
#     (?=...)           <before: ...>
#     (?!...)           <!before: ...>
#     (?<=...)          <after: ...>
#     (?<!...)          <!after: ...>

Cute.  (Wait a minute, aren't those reversed?)

#     (?>...)           <grab: ...>
#     (?(cond)t|f)      Not sure.  Could just use { if ... }

<if(cond):true|false>?

# Obviously the <word> and <word:...> syntaxes will be user 
# extensible. We have to be able to support full grammars.  I 
# consider it a feature that <foo> looks like a non-terminal in 
# standard BNF notation.  I do not consider it a misfeature 
# that <foo> resembles an HTML or XML tag, since most of those 
# languages need to be matched with a fancy rule named <tag> anyway.

But that *does* make it harder to define the fancy rules.  I could see
someone defining rules like:

        'gt' => qr/\</,
        'lt' => qr/\>/

just to get around backslashing everything in sight.

# An interesting idea would be that if you say
# 
#     m<foo: pat>
# 
# or
# 
#     m{code}
# 
# it's as if you said
# 
#     m/<foo: pat>/
#     
# or
#     
#     m/{code}/

I don't know about that one.  I often use {} as delimiters on regexen
because it's a character that doesn't occur in data very often.  I think
the gain of two characters isn't as critical as the loss of options.
 
Understand, I'm not a regex Luddite.  I've been working with yacc and
lex a lot lately, so I have at least a hint of how powerful formal
parsing is--and I love all of these features.  However, I think that
syntactically a lot of this is a loss for the average Perl hacker.  (Not
me, not you, and not most of the people on this list--the *average*
hacker, like the 3s or 4s on PerlMonks.)

The *average* Perl hacker doesn't have much use for embedded code in a
regex or BNF-like rules.  The *average* Perl hacker just wants to do an
s#<emphasis>(\d{1,3}(\.\d{1,3}){3})</emphasis>#<inet>$1</inet># (an
early example from "Mastering Regular Expressions").  There's a very
good chance that he knows exactly what the input data looks like and
that this will work on it.

For this simple reason, I highly suggest somehow hijacking curlies
instead, and perhaps making embedded code use two curlies.  After all,
regexes are intimidating enough already.  :^)

--Brent Dax <[EMAIL PROTECTED]>
@roles=map {"Parrot $_"} qw(embedding regexen Configure)

#define private public
    --Spotted in a C++ program just before a #include

RE: Regex and Matched Delimiters

Reply via email to