Re: [Oorexx-devel] Regular expression progress

Rick McGuire Sat, 08 May 2010 08:14:35 -0700

Quite a bit of progress since I sent this.  Next piece is conditional
patterns, which might take me a little longer to add because I somehow
missed them when I made my first pass through Mastering Regular
Expressions.  This is going to be all new code as a result, and the
Java version doesn't support this, so it will be a little more
difficult for me to compare results to make sure I'm doing things
correctly.

The reusable patterns is something new I cooked up, but I'm very
amazed that nobody has ever done something like this.  In reading up
on regular expressions, I'm amazed that books like Regular Expressions
Cookbook exist that users have to type in or cut-and-paste regular
expressions that are multiple lines long.  Just examine the patterns
for matching URLs on the Regular Expressons Cookbook, for example.
Those are nuts!  There should be an easier way to use that sort of
expertise without having to deal with long series of strange looking
characters.  There's already a flavor of this with named class
patterns such as \p{Lower}.  I've extended this with the ooRexx
version by allowing additional named families to be added to a regex
compiler instance.

So, my thinking here is to add a similar syntax to allow reusable
named patterns to be referenced in a regular expression.  There are
only a few letters available for escaped operations where both the
lowercase and uppercase letters are not used in various flavors.  I'm
tentatively considering using \m and \M (for the NOT version).
Another option would be to overload \p and \P by using a different
delimiter for the name.  <name> is used for named groups, so \p<url>
would be a compatible extension to what other regex dialects use.

So, for example, to validate a line containing just a URL, you could
use the following regex expression:

^\m{url}$

vs. something like this.

^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel
Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$

It would be really useful to be able to extract information from the
saved patterns using named groups to get pieces of information from a
match.  For example a URL pattern might allow you to extract the
protocol, domain, port, etc. pieces of the URL by name after a
successful match.  My current thinking is to make this information
available from the group that contains the pattern reference using the
same name as the compiled pattern.  So, to get the host information
from a match, you would use something like this:

r = p~find(line)
if r~matched then do
   -- group(0) is the main matching group, get the match information from the
   -- url reference and extract its "host" group
   say "The target host is" r~group(0)~pattern("url")~group("host")~text
end

I envision having the base compiler supporting a wide list of common
matching patterns, and it will be also be possible to add additional
pattern types to the set of callable patterns.

Anyway, once I have conditionals done, and implement the comment node
that the example above demonstrates I'm also missing, I'll start
playing with this capability.

Rick

On Wed, May 5, 2010 at 12:20 PM, Rick McGuire <object.r...@gmail.com> wrote:
> The regular expression incubator project is moving along at a fairly
> good pace.  Most of the basics are now implemented and have unit
> tests, so many of the standard expression types should be working now.
>  Stuff I have yet to finish are:
>
> 1)  Unit tests for "lookarounds"

Done

> 2)  Non-capturing groups

Done

> 3)  Atomic non-capturing groups

Done

> 4)  The various option flags, both on the compiler instance and as
> flags in the expressions.

Done

> 5)  Conditional patterns
> 6)  Reuseable patterns
> 7)  Tests for the parser class
> 8)  Tests for the split method
> 9)  Support for /Q /E qualifiers during parsing.

Done
>
>
> This is mostly a todo list to remind me of what I still need to do.
> I'd love if people would start trying this out, or even better, start
> writing tests cases for this.  It would be great to take something
> like the Regular Expressions Cookbook and test out the different
> expressions there in a test group.  Once I'm reasonable comfortable
> with the quality of this, I'll start considering doing something like
> building regex support into the parse instruction or using regex
> filters on the new File class in 4.1 (hint, hint).  However, if there
> really doesn't appear to be any interest in this, it will likely stay
> in the incubator.
>
> Rick
>

------------------------------------------------------------------------------

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] Regular expression progress

Reply via email to