Regex and Matched Delimiters

Mike Lambert Fri, 19 Apr 2002 18:24:11 -0700

Since A5 is the next thing up, it's probably worthwhile to try and get any
changes we want to regexes before they get finalized. :)


The (?something .. ) syntax which was used to extend the lifetime of the
parenthesis in Perl5 was due to the lack of matched delimiters.

With a little discussion that happened between DrForr, brentdax, and I, we
came up with using {} as another set of delimiters. One proposal had their
use as being unambiguous, as long as you didn't have a number as the first
thing after the opening {. {\d would be the old use, and {\D would be the
new use.

I didn't like how everything got squeezed into Perl5's (? .. ) syntax.
They were:
(?#..)
(?imsx-imsx:..)
(?:..)
(?=..)
(?!..)
(?<=..)
(?<!..)
(?{..})
(??{..})
(?>..)
(?(..)..|..)

>From what I understand (?( and (?{ can both be combined into a smarter use
of (??{ .
(?($a)$b|$c) => (??{ $a ? $b : $c }) #using perl5 ?: operator here
(?{$a}) => (??{ $a; qr// })
(??{$a}) => (??{ $a })

I have a feeling Larry has already weighed in on that particular issue,
but I don't remember what he said.


The various zero-width (positive|negative) look-(behind|ahead) assertions
could be combined into some sort of orthogonal system.

For example, assuming we have {} available, we can define the operater
roughly like: {(\S+) (.*?)} (assuming that .*? handles nestedness
properly)

The values of \S+ would be split into characters, and each character would
implement some directive. So for example:

= try to match what follows -- positive assertion. the default.
! match the regex that follows and fail -- negative assertion

> match forwards. the default.
< match backwards. This would be followed by a sexeger. Limiting it to be
a lookbehind would necessarily restrict what follows to be a constant
string, and not a regex. If you want to implement

| negative width match. | is skinny, and doesn't take up any room
_ regular matching. the default. _ is wide and does take up room.
(this assumes you program perl in non-monospaced fonts ;)

And luckily, these don't destroy the non-termination quality of regexes,
either.

So by default, all regexes are surrounded by {_=> } to indicate they try
to match what follows, consume text, and are heading forwards.

Here's an example of a convoluted use of these:
"regexes" =~ /({> regexes{< sexeger})regexes}{< sexeger}
{|= sexeger}{!| regexes}/

succeeds, with $& eq ''.

The order characters are matched is:
regexessexegerregexessexeger(negative zero-width attempt on sexeger, which
fails, so match continues)(postive zero-width attempt on regexes, which
succeeds, so match succeeds). At the start of the string.

$1 and $2 and so on would reflect the characters being matched. So $1 in
the above example would be regexes. This destroys the COW optimizations we
can do on $1, $`, $', and so on to avoid making large copies of strings
everywhere. However, we only need to kill this optimization *if* the regex
contains {< ..}. If it doesn't we can just use indices into the string to
keep the overhead low.

brentdax asked me for a use of {< blah} to change within a regex, but I
was unable to come up with one. We both see the need for regexes that
match backwards as a whole, but he argues that switching in the middle of
a regex match doesn't have a purpose. Can anyone come up with a useful
regexesexeger? :) I still think we should be able to change direction
mid-regex, just because it's cooler, and fits in well with the system I've
described. :)

Now, for the mapping of the (?..'s into the { syntax. I'm going to give
the full-out form, although they can often be simplified if you know
something about the context in which you are placing the item (which
you often do). The third column represents what the thing would look like
in the context of a regular unmodified regex.

Perl5   Generic-CompletelySpecified    Perl6-inRegularRegex
(?:     {_=>                           {
(?=     {|=>                           {|
(?!     {|!>                           {|!
(?<=..) {|<= \Q({reverse '..'})\E}     {|< \Q({reverse '..'})\E}
(?<!..) {|<! \Q({reverse '..'})\E}     {|<! \Q({reverse '..'})\E}

Pros/cons:
-  Non-capturing grouping is now much easier: {a|b|c}+
-  User-extensible to support additional meta characters. Want to add
something which does god-knows what? Register a character along with
compilation logic for the contents. The above all do not require changes
to the current Parrot regex engine, and were likely implemented in Perl5
only because it was it didn't require drastic changes to the nature of the
engine there either. It's mostly just a compilation detail.

Cons:
-  ! and | are way too similar. Better symbol for | is welcome.
-  User has to remember exactly how to do lookaheads using the operators.
Of course, they needed to remember what the syntax was before, so it's not
like the current version will be much worse.
-  Need to manually reverse $text, and it looks much worse. Perhaps
this particular "forwards-reading text, but match before us, starting
length '..' before" idiom can be Huffman-encoded somehow?

And for example, one can register 'e' as eval, and use the misx characters
to turn on/off those symbols for the thing enclosed in braces. The problem
with this latter idea is that the proposed system does not support turning
off of these pragmas. Perhaps we could make '-' a special character which
acts like it did in (?misx-misx:  ?

One could register 'e'=>eval and 'r'=>regex-interpolate to handle:
{e print "hello"} and {r $code = calculation($&);qr/$code/}
(Assuming you want to keep these seperate, and don't want to force 'e' to
become {r print "hello";qr//} where qr// represents an empty regex that
doesn't insert anything into the regex stream.

I think I'll stop here, as I think I've said more than enough.

Mike Lambert

Regex and Matched Delimiters

Reply via email to