Quite a bit of progress since I sent this. Next piece is conditional patterns, which might take me a little longer to add because I somehow missed them when I made my first pass through Mastering Regular Expressions. This is going to be all new code as a result, and the Java version doesn't support this, so it will be a little more difficult for me to compare results to make sure I'm doing things correctly.
The reusable patterns is something new I cooked up, but I'm very amazed that nobody has ever done something like this. In reading up on regular expressions, I'm amazed that books like Regular Expressions Cookbook exist that users have to type in or cut-and-paste regular expressions that are multiple lines long. Just examine the patterns for matching URLs on the Regular Expressons Cookbook, for example. Those are nuts! There should be an easier way to use that sort of expertise without having to deal with long series of strange looking characters. There's already a flavor of this with named class patterns such as \p{Lower}. I've extended this with the ooRexx version by allowing additional named families to be added to a regex compiler instance. So, my thinking here is to add a similar syntax to allow reusable named patterns to be referenced in a regular expression. There are only a few letters available for escaped operations where both the lowercase and uppercase letters are not used in various flavors. I'm tentatively considering using \m and \M (for the NOT version). Another option would be to overload \p and \P by using a different delimiter for the name. <name> is used for named groups, so \p<url> would be a compatible extension to what other regex dialects use. So, for example, to validate a line containing just a URL, you could use the following regex expression: ^\m{url}$ vs. something like this. ^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$ It would be really useful to be able to extract information from the saved patterns using named groups to get pieces of information from a match. For example a URL pattern might allow you to extract the protocol, domain, port, etc. pieces of the URL by name after a successful match. My current thinking is to make this information available from the group that contains the pattern reference using the same name as the compiled pattern. So, to get the host information from a match, you would use something like this: r = p~find(line) if r~matched then do -- group(0) is the main matching group, get the match information from the -- url reference and extract its "host" group say "The target host is" r~group(0)~pattern("url")~group("host")~text end I envision having the base compiler supporting a wide list of common matching patterns, and it will be also be possible to add additional pattern types to the set of callable patterns. Anyway, once I have conditionals done, and implement the comment node that the example above demonstrates I'm also missing, I'll start playing with this capability. Rick On Wed, May 5, 2010 at 12:20 PM, Rick McGuire <object.r...@gmail.com> wrote: > The regular expression incubator project is moving along at a fairly > good pace. Most of the basics are now implemented and have unit > tests, so many of the standard expression types should be working now. > Stuff I have yet to finish are: > > 1) Unit tests for "lookarounds" Done > 2) Non-capturing groups Done > 3) Atomic non-capturing groups Done > 4) The various option flags, both on the compiler instance and as > flags in the expressions. Done > 5) Conditional patterns > 6) Reuseable patterns > 7) Tests for the parser class > 8) Tests for the split method > 9) Support for /Q /E qualifiers during parsing. Done > > > This is mostly a todo list to remind me of what I still need to do. > I'd love if people would start trying this out, or even better, start > writing tests cases for this. It would be great to take something > like the Regular Expressions Cookbook and test out the different > expressions there in a test group. Once I'm reasonable comfortable > with the quality of this, I'll start considering doing something like > building regex support into the parse instruction or using regex > filters on the new File class in 4.1 (hint, hint). However, if there > really doesn't appear to be any interest in this, it will likely stay > in the incubator. > > Rick > ------------------------------------------------------------------------------ _______________________________________________ Oorexx-devel mailing list Oorexx-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oorexx-devel