Re: [Oorexx-devel] Regular expression progress

Gil Barmwater Sat, 08 May 2010 09:19:52 -0700

Wow!  I saw the first post and Mark's reply and meant to respond but... 
So now here goes.  Let me start by saying I had no idea this would be as 
extensive as it has turned out.  You are now so far beyond my level of 
understanding that, at best, all I can possibly contribute (code-wise) 
is any typos I might spot or minor re-writes for understandability.  And 
since I don't understand the functionality, writing test cases also 
seems out of the question.  But, I might be able to do some work on the 
.Parser class, namely taking the test cases for the Parse instruction 
(the basic/named ones) and rewriting them for that class.  Let me know 
if that is something I should spend time on before I start up that 
learning curve :-)


Now some questions about the design so far.  It appears that you have 
started with the Java regex design and are now extending it to include 
features NOT supported by Java.  Not knowing enough about this whole 
subject, can you tell me how your design compares with the design 
specified by the ECMAScript standard?  I.e. is it a (sub|super)set or 
are there mutually exclusive parts in the two versions?

Next I noted from the examples you posted that the groups collection 
seems to be 0-based - groups(0) - which is consistent with Java but not 
with Rexx.  Would you consider making it 1-based and how difficult would 
that be?

Think that's all for now.

Rick McGuire wrote:
> Quite a bit of progress since I sent this.  Next piece is conditional
> patterns, which might take me a little longer to add because I somehow
> missed them when I made my first pass through Mastering Regular
> Expressions.  This is going to be all new code as a result, and the
> Java version doesn't support this, so it will be a little more
> difficult for me to compare results to make sure I'm doing things
> correctly.
> 
> The reusable patterns is something new I cooked up, but I'm very
> amazed that nobody has ever done something like this.  In reading up
> on regular expressions, I'm amazed that books like Regular Expressions
> Cookbook exist that users have to type in or cut-and-paste regular
> expressions that are multiple lines long.  Just examine the patterns
> for matching URLs on the Regular Expressons Cookbook, for example.
> Those are nuts!  There should be an easier way to use that sort of
> expertise without having to deal with long series of strange looking
> characters.  There's already a flavor of this with named class
> patterns such as \p{Lower}.  I've extended this with the ooRexx
> version by allowing additional named families to be added to a regex
> compiler instance.
> 
> So, my thinking here is to add a similar syntax to allow reusable
> named patterns to be referenced in a regular expression.  There are
> only a few letters available for escaped operations where both the
> lowercase and uppercase letters are not used in various flavors.  I'm
> tentatively considering using \m and \M (for the NOT version).
> Another option would be to overload \p and \P by using a different
> delimiter for the name.  <name> is used for named groups, so \p<url>
> would be a compatible extension to what other regex dialects use.
> 
> So, for example, to validate a line containing just a URL, you could
> use the following regex expression:
> 
> ^\m{url}$
> 
> vs. something like this.
> 
> ^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel
> Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
> 
> It would be really useful to be able to extract information from the
> saved patterns using named groups to get pieces of information from a
> match.  For example a URL pattern might allow you to extract the
> protocol, domain, port, etc. pieces of the URL by name after a
> successful match.  My current thinking is to make this information
> available from the group that contains the pattern reference using the
> same name as the compiled pattern.  So, to get the host information
> from a match, you would use something like this:
> 
> r = p~find(line)
> if r~matched then do
>    -- group(0) is the main matching group, get the match information from the
>    -- url reference and extract its "host" group
>    say "The target host is" r~group(0)~pattern("url")~group("host")~text
> end
> 
> I envision having the base compiler supporting a wide list of common
> matching patterns, and it will be also be possible to add additional
> pattern types to the set of callable patterns.
> 
> Anyway, once I have conditionals done, and implement the comment node
> that the example above demonstrates I'm also missing, I'll start
> playing with this capability.
> 
> Rick
> 
> On Wed, May 5, 2010 at 12:20 PM, Rick McGuire <object.r...@gmail.com> wrote:
> 
>>The regular expression incubator project is moving along at a fairly
>>good pace.  Most of the basics are now implemented and have unit
>>tests, so many of the standard expression types should be working now.
>> Stuff I have yet to finish are:
>>
>>1)  Unit tests for "lookarounds"
> 
> 
> Done
> 
> 
>>2)  Non-capturing groups
> 
> 
> Done
> 
> 
>>3)  Atomic non-capturing groups
> 
> 
> Done
> 
> 
>>4)  The various option flags, both on the compiler instance and as
>>flags in the expressions.
> 
> 
> Done
> 
> 
>>5)  Conditional patterns
>>6)  Reuseable patterns
>>7)  Tests for the parser class
>>8)  Tests for the split method
>>9)  Support for /Q /E qualifiers during parsing.
> 
> 
> Done
> 
>>
>>This is mostly a todo list to remind me of what I still need to do.
>>I'd love if people would start trying this out, or even better, start
>>writing tests cases for this.  It would be great to take something
>>like the Regular Expressions Cookbook and test out the different
>>expressions there in a test group.  Once I'm reasonable comfortable
>>with the quality of this, I'll start considering doing something like
>>building regex support into the parse instruction or using regex
>>filters on the new File class in 4.1 (hint, hint).  However, if there
>>really doesn't appear to be any interest in this, it will likely stay
>>in the incubator.
>>
>>Rick
>>
> 
> 
> ------------------------------------------------------------------------------
> 
> _______________________________________________
> Oorexx-devel mailing list
> Oorexx-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
> 

-- 
Gil Barmwater

------------------------------------------------------------------------------

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] Regular expression progress

Reply via email to