> >...My point is that I think we're approaching this
> >the wrong way.  We're trying to apply more and more parser power into what
> >classically has been the lexer / tokenizer, namely our beloved
> >regular-expression engine.

I've been thinking the same thing.  It seems to me that the attempts to
shoehorn parsers into regex syntax have either been unsuccessful
(yielding an underpowered extension) or illegible or both.

An approach that appears to have been more successful is to find ways
to integrate regexes *into* parser code more effectively.  Damian
Conway's Parse::RecDescent module does this, and so does SNOBOL.

In SNOBOL, if you want to write a pattern that matches balanced
parenteses, it's easy and straightforward and legible:

        parenstring = '(' *parenstring ')'      
                    | *parenstring *parenstring        
                    | span('()')


(span('()') is like [^()]* in Perl.)

The solution in Parse::RecDescent is similar.

Compare this with the solutions that work now:

     # man page solution
     $re = qr{
              \(
                (?:
                   (?> [^()]+ )    # Non-parens without backtracking
                 |
                   (??{ $re })     # Group with matching parens
                 )*
              \)
            }x;

This is not exactly the same, but I tried a direct translation:

     $re = qr{ \( (??{$re}) \)
             | (??{$re}) (??{$re})
             | (?> [^()]+)
             }x;

and it looks worse and dumps core.  

This works:

        qr{
          ^
          (?{ local $d=0 })
          (?:                   
              \(                
              (?{$d++})         
           |  
              \)
              (?{$d--})
              (?        
                (?{$d<0})
                (?!)     
              )  
           |      
              (?> [^()]* )
                          
          )* 

                                
          (?                    
            (?{$d!=0})          
            (?!)                
          )
         $
        }x;

but it's rather difficult to take seriously.

The solution proposed in the recent RFC 145:

        /([^\m]*)(\m)(.*?)(\M)([^\m\M]*)/g

is not a lot better.  David Corbin's alternative looks about the same.

On a different topic from the same barrel, we just got a proposal that
([23,39]) should match only numbers between 23 and 39.  It seems to me
that rather than trying to shoehorn one special-purpose syntax after
another into the regex language, which is already overloaded, that it
would be better to try to integrate regex matching better with Perl
itself.  Then you could use regular Perl code to control things like
numeric ranges.  

Note that at present, you can get the effect of [(23,39)] by writing
this:

                (\d+)(?(?{$1 < 23 || $1 > 39})(?!))

which isn't pleasant to look at, but I think it points in the right
direction, because it is a lot more flexible than [(23,39)].  If you
need to fix it to match 23.2 but not 39.5, it is straightforward to do
that:  
  
        (\d+(\.\d*)?)(?(?{$1 < 23 || $1 > 39})(?!))

The [(23,39)] notation, however, is doomed.    All you can do is
propose Yet Another Extension for Perl 7.

The big problem with 

                (\d+)(?(?{$1 < 23 || $1 > 39})(?!))

is that it is hard to read and understand.

The real problem here is that regexes are single strings.  When you
try to compress a programming language into a single string this way,
you end up with something that looks like Befunge or TECO.  We are
going in the same direction here.

Suppose there were an alternative syntax for regexes that did *not*
require that everything be compressed into a single string?  Rather
than trying to pack all of Perl into the regex syntax, bit by bit,
using ever longer and more bizarre punctuation sequences, I think a
better solution would be to try to expose the parts of the regex
engine that we are trying to control.

I have some ideas about how to do this, and I will try to write up an
RFC this week.

Reply via email to