Re: Regex and Matched Delimiters

Larry Wall Tue, 23 Apr 2002 09:32:24 -0700

Brent Dax writes:
: #     ?pat?                   /<?f:pat/  ???
: #     /pat/i          m:i/pat/ or /<?i:pat>/ or even m<?i:pat> ???
: 
: Whoa, those are moving to the front?!?


The problem with options in general is that they can't easily modify
parsing if they come in back.  Now in the particular case of /f and /i,
it probably doesn't matter.  But I was trying to see if there was some way
to do away with trailing options altogether.  This might even extend to
things like:

    qq:s"$interpolates @doesn't %doesn't"

And that's definitely a situation where it changes the parse.  Hmm, if
strings have options, they're probably addititive, so to add scalar
interpolation you'd want to base it on "q", not "qq":

    q:s"$interpolates @doesn't %doesn't"

On the other hand, that doesn't work for the other things like "qr", so
maybe any of :s, :a, :h turn off default interpolations, so qr:a would
only interpolate arrays, for instance.

: #     /pat/x          /pat/
: #     /^pat$/m                /^^pat$$/
: 
: That's...odd.  Is $$ (the variable) going away?

Maybe.  It'd be $*PID if so, since it's truly global to the process.
But if not, we could special case $$ inside regexes, just as we already
special case $ itself.

: #     \p{prop}                <+prop>  ???
: #     \P{prop}                <-prop>  ???
: 
: Intriguing.

Yeah, especially when you start stacking them.  But maybe we're treading
on [...] territory.  It could be argued that <...> is just a generalized
form of POSIX's [:...:] construct

: #     \t                      also <tab>
: #     \n                      also <lf> or <nl> (latter matching
: logical newline)
: #     \r                      also <cr>
: #     \f                      also <ff>
: #     \a                      also <bell>
: #     \e                      also <esc>
: 
: I can tell you right now that these are going to screw people up.
: They'll try to use these in normal strings and be confused when it
: doesn't work.  And you probably won't be able to emit a warning,
: considering how much CGI Perl munches.

I can see pragmatic variants in which those *do* interpolate by default.
And pragmatic variants where they don't.

: #     \033                    same
: #     \x1B                    same
: #     \x{263a}                \x<263a> ???
: 
: Why?  Wouldn't we want the same thing to work in quoted strings?  (Or
: are those changing syntaxes too?)

I'm just wondering how far I can drive the principle that {} is always
a closure (even though it isn't).  I admit that it's probably overkill
here, which is why there are question marks.

: #     \c[                     same
: #     \N{name}                <name>
: #     \l                      same
: #     \u                      same
: #     \Lstring\E              \L<string>
: #     \Ustring\E              \U<string>
: 
: So that's changed from whenever you talked about \q{} ?

Possibly.  Again, the question is whether {} more strongly imply
something that's not true.  But curlies were so overloaded in Perl 5
that I don't think people are going to necessarily expect them to do
only one thing.  Still, if <> are taking over the role of "unmarked
metasyntactic delimiters", maybe they belong here too.

: #     \E                      gone
: #     [\040\t]                \h        plus any Unicode horizontal whitespace
: #     [\r\n\ck]               \v      plus any Unicode vertical whitespace
: #=20
: #     \b                      same
: #     \B                      same
: 
: #     \A                      ^
: #     \Z                      same?
: #     \z                      $
: 
: Are you sure that optimizes for the common case?

No, I'm not sure, but we have to clean up the \A...\z mess somehow.

: #     \G                      <pos>, but assumed in nested patterns?
: # =20
: #     \1                      $1
: #=20
: #     \Q$var\E                $var    always assumed literal, so $1 is literal
: backref
: 
: So these are reinterpolated every time you backtrack?  Are you *trying*
: to destroy regex performance?  :^)

They're not interpolated.  They're matched, as in string comparison, just
as backrefs are matched right now.

: #     $var                    <$var>  assumed to be regex
: 
: What if $var is a qr//ed object?

Then it's a pretty easy assumption that it's a regex.  :-)

: #     =~ $re          =~ /<$re>/   ouch?
: 
: I don't see the win.

No difference if $re is qr//, but if it's not, that is the syntax for
forcing $re to be interpreted as a regex.

: #     (??{$rule})             <rule>
: #     (?{ code })             { code } with failure semantics
: #     (?#...)         {"..."}         :-)
: #     (?:...)         <:...>
: #     (?=3D...)               <before: ...>
: #     (?!...)         <!before: ...>
: #     (?<=3D...)              <after: ...>
: #     (?<!...)                <!after: ...>
: 
: Cute.  (Wait a minute, aren't those reversed?)

Nope, I realized they were ambiguous depending on whether you think of
them as declarative or operational, but I settled on the declarative
reading because it works with their being assertions.  All the other
options I could think of are either really clunky or similarly ambiguous.

: #     (?>...)         <grab: ...>
: #     (?(cond)t|f)    Not sure.  Could just use { if ... }
: 
: <if(cond):true|false>?

Well, sure, if you're attached to that particular set of punctuation.
But we could also have

    <if cond:    ...>
    <elsif cond: ...>
    <else:       ...>

On the other hand, I think we'll often see parsers doing things like:

    $TERM = qr/{
        when cond { /.../ }
        when cond { /.../ }
        when cond { /.../ }
        when cond { /.../ }
        when cond { /.../ }
        when cond { /.../ }
        default   { /.../ }
    }/;

So maybe the <> version is:

    <when cond: ...>
    <when cond: ...>
    <when cond: ...>
    <when cond: ...>
    <when cond: ...>
    <default:   ...>

(assuming the scoping of "break" can be worked out).

: # Obviously the <word> and <word:...> syntaxes will be user=20
: # extensible. We have to be able to support full grammars.  I=20
: # consider it a feature that <foo> looks like a non-terminal in=20
: # standard BNF notation.  I do not consider it a misfeature=20
: # that <foo> resembles an HTML or XML tag, since most of those=20
: # languages need to be matched with a fancy rule named <tag> anyway.
: 
: But that *does* make it harder to define the fancy rules.  I could see
: someone defining rules like:
: 
:       'gt' =3D> qr/\</,
:       'lt' =3D> qr/\>/
: 
: just to get around backslashing everything in sight.

I could see someone saying qr:X or some such.

: # An interesting idea would be that if you say
: #=20
: #     m<foo: pat>
: #=20
: # or
: #=20
: #     m{code}
: #=20
: # it's as if you said
: #=20
: #     m/<foo: pat>/
: #    =20
: # or
: #    =20
: #     m/{code}/
: 
: I don't know about that one.  I often use {} as delimiters on regexen
: because it's a character that doesn't occur in data very often.  I think
: the gain of two characters isn't as critical as the loss of options.
: =20
: Understand, I'm not a regex Luddite.  I've been working with yacc and
: lex a lot lately, so I have at least a hint of how powerful formal
: parsing is--and I love all of these features.  However, I think that
: syntactically a lot of this is a loss for the average Perl hacker.  (Not
: me, not you, and not most of the people on this list--the *average*
: hacker, like the 3s or 4s on PerlMonks.)
: 
: The *average* Perl hacker doesn't have much use for embedded code in a
: regex or BNF-like rules.  The *average* Perl hacker just wants to do an
: s#<emphasis>(\d{1,3}(\.\d{1,3}){3})</emphasis>#<inet>$1</inet># (an
: early example from "Mastering Regular Expressions").  There's a very
: good chance that he knows exactly what the input data looks like and
: that this will work on it.
: 
: For this simple reason, I highly suggest somehow hijacking curlies
: instead, and perhaps making embedded code use two curlies.  After all,
: regexes are intimidating enough already.  :^)

With respect to Perl 5, I'm trying to unhijack curlies as much as possible.

Larry

Re: Regex and Matched Delimiters

Reply via email to