----- Original Message -----
From: "Richard Proctor" <[EMAIL PROTECTED]>
Sent: Tuesday, September 05, 2000 1:49 PM
Subject: Re: RFC 145 (alternate approach)


> On Tue 05 Sep, David Corbin wrote:
> > Nathan Wiger wrote:
> > > But, how about a new ?m operator?
> > >    /(?m<<|[).*?(?M>>|])/;
> There already is a (?m
> Current Use in perl5
> (?# comment
> (?imsx flags
> (?-imsx flags
> (?: subexpression without bracket capture
> (?= zero-width positive look ahead
> (?! zero width negative look ahead
> (?<= zero-width positve look behind
> (?<! zero width negative look behind
> (?{code} Execute code
> (??{code} Execute code and use result as pattern
> (?> Independant subexpression
> (?(condition)yes-pattern
> (?(condition)yes-pattern|no-pattern
>
> Suggested in RFCs either current or in development
>
> (?$foo= suggested for assignment (RFC 112)
> (?%foo= suggested for hash assignment (RFC 150?)
>
> (?@foo suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166)
> (?Q@foo) Quote each item of lists (RFC 166)
> (?^pattern) matches anything that does not match pattern
> (RFC 166 but will be somewhere else on next rewrite [1])
> (?F Failure tokens (RFC in development by me [1])
> (?r),(?f) Suggested in Direction Control RFC 1
> (?& Boolean regexes (RFC in development [1])
> (?*{code}) Execute code with pass/fail result (RFC in development [1])
>
> a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z
> A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z
> 0,1,2,3,4,5,6,7,8,9
> `_,."+[];'~)

Ok, I've read through some of the archives, and thought this was a good
starting point.
I haven't seen any discussion on an obvious solution (though in another
email, I suggested that this approach should be foregone in favor of a
parsing approach.. But one thing at a time).

There are two general problems as I see it.  First, you have to be able to
specify exactly what you're matching.  Obviously generically matching "[<(`"
etc is going to be upset if your nesting has simple things like " a < 5 " or
"I'm going home, it's hot".  A design goal, therefore should be to
explicitly state the matching characters.  Second, you need to be able to
apply additional expression-syntax to match inside the nesting.

An additional problem occurs when you suggest using pragmas to specify
delimeters.  It could be a performance hit, if not a developer's nightmare.
When I run eval, must I always set the pragma, just in case there is some
wierd scoping problem?  Same problem as when using all global variables (and
the 'local' keyword.  God I hate that thing).

Therefore, I suggest a commonly used form:

/(?N [ { ] ......... )/x

Note that I use N which stands for nesting instead of the redunant 'M'atch.
I don't know how well character-based op-codes will be accepted.  As pointed
out above, the symbol-space is shrinking fast.

The dots describe further matching / capturing within the delimeters.  Thus
/A (?N [ { ] ) B/x
will match 'A' followed by a bracket grouping (anything therein is fine),
then followed by 'B'.

/A (?N [ { ] ( .* ) ) B/x
does the same as above, but captures the internal contents (excluding the
delimeters).

/A ( (?N [ { ]  ) ) B/x
Will capture all the conents, including the delimeters.

/A (?N [ [ ( ]  ( .* )  ) B/x
Same as before, but with squares and parentheses.  Note delim specifiers can
obey the same rules as normal character classes, thus [ [ ( { < ] means
collect the entire group.  POSIX classes can be used for all of them, as in
[=open_braces=] (don't care what the phrase actually is).  The reason I
chose this is becuase we are essentially doing a character class, so we
might as well explicitly use one; It makes more logical sence.  Note that to
make emacs happy, you should be able to escape all the one-way delimeters.
as in [ \[ \( \{ \< ].  That might also make it easier to read, explicitly
showing that these are being treated as characters, and not as actual
operators.

As for special operations such as (/* ... */ ), then I would recommend the
usage of named-character classes.  [=c_comment=], for example.  I'm not sure
how those classes are defined, but this obviously requires the system to be
extensible (RFC anyone?).  Course this violates my issue of using pragmas to
alter the operation of reg-ex's.  Most likely only built-in types should
work.

Another feature could be to treat the end of matching-brace as an
end-of-line.  Thus the above .* will properly exit.  If this turns out to
not work, then .* can necessarily be replaced by .*?.  The advantage of this
is in nested expressions, as in:

$r_kw = qr/Keyword \s* .* /x;
$r_lisp_expr = qr/ (?N [ ( ] $r_kw ) /x;
$line = <>;
$line =~ $r_lisp_expr;

But this would also have worked with:
$r_kw = qr/Keyword \s* .* $/x;
Since '$' would treat ')' as '\n'.

The main advantages of this approach are:
    That you can still pre-compile an expression and garuntee that it won't
need recompiling, and that it'll always act the same.
    That you can nest the puppies with complete lack of ambiguity, and
little possibility for syntactical error through variation; missing leading
[xxx] is a compiler error just like any other zero-length operator.
    The next effect is a zero length operation that simply treats the
matching close brace as an end-of-line.

My theory on how this could be implemented (only knowing as much about the
engine as I've read in books), is that the outer expression would be
recursively matched first.  Then the sub-expression would be matched against
the inside.  This is almost identical to how nested parenthesis currently
work.. The effect of this would be to make it less scarry to the casual
reader.  Also, it avoids having to pre-compile the parens matcher
seperately.  It would simply be built in.  Additionally, by allowing various
text inside, you might be able to produce much more complex statements, such
as

/ (?N [ ( ] \(* \w+ ) /x

In a lisp-type environment, this might mean that any opening parens must
immediately be followed by a word.  By itself, this might not be too useful,
though you might apply some sort of inserted code to operate on the word.
However, this quickly returns to the undesirable state of inside-out
programming.

As for XML, I have to admit that this isn't the best solution.  The closest
I can get is:
/ (?N [=xml=] ) /
where xml is defiened as <(\w+)[^<>]*> with a closing matching delim as
</\1>.  The only way this could be useful is if global variables are set,
$^matching_delim, for example. (I'm not really for more single character
global variables).

Perhaps a perversion could be used:
/ (?N (<(\w)+[^<>]*>) (<\1>)  ) /

Where the format is now:
(?N (header) (footer) optional sub-matching-text )

The obvious problem with this is that "\1" either has to be nested (ignoring
all outside capture numbers), or the developer has to keep track of which
match they've used.  This works if you break it out:

$nesting = qr/ (?N (<(\w)+[^<>]*>) (<\1>) ) /x
$match = qr/ ( $this ) ( $that ) ( $nesting ) /x

More importantly, xml isn't as uniform as simple nested parens.  <!--
... --> might contain broken HTML, and the <! would have to act as a quoter.
I'm not sure but I think you can quote braces too, as in <tag attr="<hi
there>">.  I know the accepted way is to perform <tag attr="&lt;hi
there&gt;">, but whatever.  The reality is that you're dealing with parser
problems at this level of complexity.  Do we really want to build an entire
xml processor inside of perl?

In short, I don't like a reg-ex doing too much with xml.

HOWEVER, rhere is one important argument for it.  XML _might_ become the
defacto meta-data standard.  In this case, xml extractors and reporters will
become increasingly important.  If perl does not adapt, then it will lose
it's importance in this niche.  When I think of data-manipulation, my first
thought is always perl.  Right now, however, when people think of XML, they
think of the many c and java libraries, or the big-name xml databases.
Though perl isn't a database, xml is a glue meta-data, which philosophically
should be intimately compatible with perl.

Comments welcome.

-Michael


Reply via email to