Re: [Oorexx-devel] Regular expression progress

Rony G. Flatscher Sat, 08 May 2010 09:37:33 -0700

Indeed, your progress has been great, witnessing the commits and
analyzing the test cases and comparing them to the Java specs (and now
seeing you are going beyond it to add additional useful features, making
the regex package even more powerful).


There is one particular feature which I think will make using regular
expressions extremely easy for many cases (and users) and that is your
named patterns! The first time I heard you talk about it back in Tampa I
was ever fascinated by them and had hoped that they will get
implemented  eventually (also that some easy to use syntax for employing
them within parse will be formulated eventually). (Otherwise I would
have thought that it would not be necessary as one could use the Java
regex directly.)

---

One caveat though: regular expressions are very powerful, but (since a
lot of Perlish style (PCRE) in the form of cryptic letters at certain
positions meaning something specific, shines through) hard to learn and
hard to keep the acquired knowledge, if one does not use them for a
longer time. Being confronted with people who are beginners in learning
a programming language, regular "regular expressions" are something
almost unbearable for them. So one idea would be to try to come up with
some "Rexxish" notation for the regex in addition to the "pure" one,
which will be a boon for the regex-experts. E.g. coming up with an
encoding for the regex patterns that make it instantenous clear that a
pattern is greedy, lazy/reluctant or possessive, or all the
non-capturing lookaround  "(?" patterns and the like. The Rexxish
version should be as self-describable as possible (such that one does
not need to look-up the regex documentation to learn what the meaning of
an expression would be in effect).

The goal would be that Rexx programmers who have never dealt with regex
the power of your implementation may become easily tangible. And of
course, all ideas would be helpful that would allow to define a Rexxish
interface to this powerful regex package.

So the idea would be, that eventually there would be an "expert regex"
and a "Rexx regex" which is fully based on the "expert regex" package.

---rony

P.S.: One feature that would be also nice is being able to redefine
predefined character classes (and maybe named patterns), like being able
to redefine "\w" and the like (this would allow to use "\w" to e.g.
match a Rexx word or a German word and the like). That's would is partly
possible in the vim syntax highlighting definition file (which is based
on regex, but one is able to redefine the fundamental regex).




On 08.05.2010 17:14, Rick McGuire wrote:
> Quite a bit of progress since I sent this.  Next piece is conditional
> patterns, which might take me a little longer to add because I somehow
> missed them when I made my first pass through Mastering Regular
> Expressions.  This is going to be all new code as a result, and the
> Java version doesn't support this, so it will be a little more
> difficult for me to compare results to make sure I'm doing things
> correctly.
>
> The reusable patterns is something new I cooked up, but I'm very
> amazed that nobody has ever done something like this.  In reading up
> on regular expressions, I'm amazed that books like Regular Expressions
> Cookbook exist that users have to type in or cut-and-paste regular
> expressions that are multiple lines long.  Just examine the patterns
> for matching URLs on the Regular Expressons Cookbook, for example.
> Those are nuts!  There should be an easier way to use that sort of
> expertise without having to deal with long series of strange looking
> characters.  There's already a flavor of this with named class
> patterns such as \p{Lower}.  I've extended this with the ooRexx
> version by allowing additional named families to be added to a regex
> compiler instance.
>
> So, my thinking here is to add a similar syntax to allow reusable
> named patterns to be referenced in a regular expression.  There are
> only a few letters available for escaped operations where both the
> lowercase and uppercase letters are not used in various flavors.  I'm
> tentatively considering using \m and \M (for the NOT version).
> Another option would be to overload \p and \P by using a different
> delimiter for the name.  <name> is used for named groups, so \p<url>
> would be a compatible extension to what other regex dialects use.
>
> So, for example, to validate a line containing just a URL, you could
> use the following regex expression:
>
> ^\m{url}$
>
> vs. something like this.
>
> ^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel
> Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
>
> It would be really useful to be able to extract information from the
> saved patterns using named groups to get pieces of information from a
> match.  For example a URL pattern might allow you to extract the
> protocol, domain, port, etc. pieces of the URL by name after a
> successful match.  My current thinking is to make this information
> available from the group that contains the pattern reference using the
> same name as the compiled pattern.  So, to get the host information
> from a match, you would use something like this:
>
> r = p~find(line)
> if r~matched then do
>    -- group(0) is the main matching group, get the match information from the
>    -- url reference and extract its "host" group
>    say "The target host is" r~group(0)~pattern("url")~group("host")~text
> end
>
> I envision having the base compiler supporting a wide list of common
> matching patterns, and it will be also be possible to add additional
> pattern types to the set of callable patterns.
>
> Anyway, once I have conditionals done, and implement the comment node
> that the example above demonstrates I'm also missing, I'll start
> playing with this capability.
>
> Rick
>
> On Wed, May 5, 2010 at 12:20 PM, Rick McGuire <object.r...@gmail.com> wrote:
>   
>> The regular expression incubator project is moving along at a fairly
>> good pace.  Most of the basics are now implemented and have unit
>> tests, so many of the standard expression types should be working now.
>>  Stuff I have yet to finish are:
>>
>> 1)  Unit tests for "lookarounds"
>>     
> Done
>
>   
>> 2)  Non-capturing groups
>>     
> Done
>
>   
>> 3)  Atomic non-capturing groups
>>     
> Done
>
>   
>> 4)  The various option flags, both on the compiler instance and as
>> flags in the expressions.
>>     
> Done
>
>   
>> 5)  Conditional patterns
>> 6)  Reuseable patterns
>> 7)  Tests for the parser class
>> 8)  Tests for the split method
>> 9)  Support for /Q /E qualifiers during parsing.
>>     
> Done
>   
>>
>> This is mostly a todo list to remind me of what I still need to do.
>> I'd love if people would start trying this out, or even better, start
>> writing tests cases for this.  It would be great to take something
>> like the Regular Expressions Cookbook and test out the different
>> expressions there in a test group.  Once I'm reasonable comfortable
>> with the quality of this, I'll start considering doing something like
>> building regex support into the parse instruction or using regex
>> filters on the new File class in 4.1 (hint, hint).  However, if there
>> really doesn't appear to be any interest in this, it will likely stay
>> in the incubator.
>>
>> Rick
>>     


------------------------------------------------------------------------------

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] Regular expression progress

Reply via email to