RFC 274 (v2) Generalised Additions to Regexs

Perl6 RFC Librarian Sun, 01 Oct 2000 17:35:27 -0700
This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1  TITLE

Generalised Additions to Regexs

=head1 VERSION

  Maintainer: Richard Proctor <[EMAIL PROTECTED]>
  Date: 22 Sep 2000
  Last Modified: 1 Oct 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 274
  Version: 2
  Status: Frozen

=head1 ABSTRACT

This proposes a way for generalised additions to regex capabilities.

=head1 DESCIPTION

Given that expansion of regexes could include (+...) and (*...) I have
been thinking about providing a general purpose way of adding
functionality.  Hence I propose that the entire (+...) syntax is
kept free from formal specification for this. (+ = addition)

A module or anything that wants to support some enhanced syntax
registers a callback that handles "regex enhancements".

=head2 Basic operation

There are two ways these could operate: 

=head3 My original:

My thoughts where to leave the syntax completely open so that anything 
could be added - words, chinese, $$$.

At regex compile time, if and when (+foo) is found perl calls
each of the registered regex enhancements in turn, these:

1) Are passed the foo string as a parameter exactly as is.  (There is
an issue of actually finding the end of the generic foo.)

2) The regex enhancement can either recognise the content or not.

Does a newer localised definition replace the older one?  The handling of
multiple regestrations has to be resolved. My initial thoughts are that a
"Last registered is checked first" approach may be best.

3) If not the enhancement returns undef and perl goes to the next regex
enhancement (Does it handle the enhancements as a stack (Last checked first)
or a list (First checked first?) how are they scoped?  Job here for the
OO/scoping fanatics)

4) If perl runs out of registered regex enhancements it reports an error.  

=head3 Hugo alternative:

I'd be more inclined to have callbacks registered for a word: that
way we can complain earlier when two modules try to register the
same word. Then at regexp-compile time we parse out the word
following the (+ and immediately know who to pass it to (or fail).

Well, there are limits to what we can handle - earlier, the parser
will have had to be able to determine where the end of the regexp
is. Even specifying a word at the beginning doesn't help: we need
to know whether the rest should look like a regexp, or code, or
whatever else. The regexp compiler doesn't get a look in until
after that has been done.

Which suggests that maybe each callback - whether or not we link
them to words - should specify what it will match, which suggests
it should be linked with a regexp.

=head2 Actions by the Callback

If an enhancement recognises the content it could do either of:

a) return replacement expanded regex using existing capabilities perl will
then pass this back through the regex compiler.

(+...) loops are allowed though the compiler might want to issue a warning
if they appear to go too deep.

b) return a coderef that is called at run time when the regex gets to this
point.  (?{...}) or (??{...}) or maybe (?*{...}) see RFC 198.

=head2 Embeded Code

The referenced code needs to have enough access to the regex internals
to be able to see the current sub-expression, request more characters, access
to relevant flags and visability of greediness.  It may also need a coderef
that is simarly called when the regex is being unwound when it backtracks.
These features would also be of interest to the existing code inside regexes
as well.

Thinking from that - the last case should be generalised (it is sort of
like my (?*{...}) from RFC 198 or an enhancement to (??{...}).  If so cases
(a) and (b) are the same as case (b) is just a case of returning (?*{...}) the
appropriate code.  

Access to subexpresions - ok this can be done.

Visability of flags - Not curently possible. The code might like to know that
/i is in effect, it might want to know that /s is in effect it probably does
not need to know about /o.  This is equally true to the enhancement regex
handler that looks at the (+foo) in the first place.  I think that these
could be of use to (?{...}) code.

Greediness - maybe not necessary, but I think better visability of
internals might be beneficial.

Hugo: Hm, I do appreciate the problem - I wasn't too happy when I realised
that embedded qr{} expressions are protected from the flags of their outer
regexp, cos I wanted to specify /i on the outside and have it trickle in to
the rest. It feels like its going to get real messy, though, and totally
screw the optimiser.

Me: This also needs thinking about, but I cant resolve this at the
moment.

=head2 Code execution for backtracking

Following on, if (?{...}) etc code is evaluated in forward match, it would be
a good idea to likewise support some code block that is ignored on a forward
match but is executed when the code is unwound due to backtracking.  Thus 
(?{ foo })(?\{ bar }) executes foo on the forward case and bar if it unwinds. 
I dont care at the moment what the syntax is - what about the concepts. Think
about foo putting something on a stack (eg the bracket to match [RFC 145])
and bar taking it off for example.

These might be acheieved by complex localisation.  Is localisation enough?
Enough to achieve everything you might want to? Yes: you can always
have a (?{ local $a = new Object }) with a DESTROY method. It may not
necessarily be the cleanest possible way to write everything, though.

So functionality for doing this easier might be a good idea.

=head1 CHANGES

V2 - Several additions of clarity and discussion as to how the content is
recognised - mainly between myself and Hugo.  Everything here has come
from the discussion on the list.

=head1 MIGRATION

This is a new feature - no compatibity problems

=head1 IMPLENTATION

This has not been looked at in detail, but the desciption above provides
some views as to how it may operate.

=head1 REFERENCES

RFC 145: Brace-matching for Perl Regular Expressions

RFC 198: Boolean Regexes

This message from Larry:

http://www.mail-archive.com/perl6-language%40perl.org/msg02955.html
RFC 274 (v2) Generalised Additions to Regexs

Reply via email to