Re: [Oorexx-devel] Regular expression progress

Rick McGuire Sat, 08 May 2010 09:44:10 -0700

On Sat, May 8, 2010 at 12:19 PM, Gil Barmwater <gbarmwa...@alum.rpi.edu> wrote:
> Wow!  I saw the first post and Mark's reply and meant to respond but...
> So now here goes.  Let me start by saying I had no idea this would be as
> extensive as it has turned out.  You are now so far beyond my level of
> understanding that, at best, all I can possibly contribute (code-wise)
> is any typos I might spot or minor re-writes for understandability.  And
> since I don't understand the functionality, writing test cases also
> seems out of the question.  But, I might be able to do some work on the
> .Parser class, namely taking the test cases for the Parse instruction
> (the basic/named ones) and rewriting them for that class.  Let me know
> if that is something I should spend time on before I start up that
> learning curve :-)


That would be a great place to start!  I also eventually have plans to
add some editing functions, but only after I get the base functions
working.

>
> Now some questions about the design so far.  It appears that you have
> started with the Java regex design and are now extending it to include
> features NOT supported by Java.  Not knowing enough about this whole
> subject, can you tell me how your design compares with the design
> specified by the ECMAScript standard?  I.e. is it a (sub|super)set or
> are there mutually exclusive parts in the two versions?

The thinks I really based on the Java version were the use of the
pattern class and associated objects to do the matching.  As far as
the regex expressions go, I really just worked my way through
Mastering Regular Expressions and implemented every I found there, as
long as the result remained consistent.  Where there were syntax
differences, I used the posix style if there was one.  In generally,
this means that what I have is a superset of the Java version, but
I've not referred back to the ECMAScript standard yet, so I can't
really tell you.  Additions I've made over the Java version include:

- Named capture groups (stolen from the .Net version)
- \A boundary marker
- \v (vertical tab) escape character
- No unicode support, so there's no \unnnn escape sequence support,
for example.
- The \p{family} set is different.  The unicode versions are not
there, and I've added some Rexx-specific ones (RexxSymbol,
RexxVariableStart, RexxOperator, RexxSpecial).  It's also possible to
add additional named class families to a compiler instance.
- I don't support the x, c, and x options in an expression.
- When possible (or practical), I removed restrictions that the Java
version has chosen to implement.  For example, I allow non-bounded
match patterns to be used for lookbehinds.  I might be missing
something, but I couldn't figure out why that was so difficult to
allow.

I think those are the major differences.

>
> Next I noted from the examples you posted that the groups collection
> seems to be 0-based - groups(0) - which is consistent with Java but not
> with Rexx.  Would you consider making it 1-based and how difficult would
> that be?

I considered this, but don't think it is a good idea, because this
would be one place where the syntax of the expressions would be
different from a posix standard expression and a Rexx one.  Also, the
0th pattern is the main match.  The normal numbering of the capture
groups within the expression starts with 1.  So for the expression
"(abc)\1", the \1 refers to the expression inside the first capture
group.  Changing the number to be 1 based would require this to become
"(abc)\2", which would be very hard.  Changing this would also make it
much more difficult to bring in regular expression examples from other
sources.

Rick

>
> Think that's all for now.
>
> Rick McGuire wrote:
>> Quite a bit of progress since I sent this.  Next piece is conditional
>> patterns, which might take me a little longer to add because I somehow
>> missed them when I made my first pass through Mastering Regular
>> Expressions.  This is going to be all new code as a result, and the
>> Java version doesn't support this, so it will be a little more
>> difficult for me to compare results to make sure I'm doing things
>> correctly.
>>
>> The reusable patterns is something new I cooked up, but I'm very
>> amazed that nobody has ever done something like this.  In reading up
>> on regular expressions, I'm amazed that books like Regular Expressions
>> Cookbook exist that users have to type in or cut-and-paste regular
>> expressions that are multiple lines long.  Just examine the patterns
>> for matching URLs on the Regular Expressons Cookbook, for example.
>> Those are nuts!  There should be an easier way to use that sort of
>> expertise without having to deal with long series of strange looking
>> characters.  There's already a flavor of this with named class
>> patterns such as \p{Lower}.  I've extended this with the ooRexx
>> version by allowing additional named families to be added to a regex
>> compiler instance.
>>
>> So, my thinking here is to add a similar syntax to allow reusable
>> named patterns to be referenced in a regular expression.  There are
>> only a few letters available for escaped operations where both the
>> lowercase and uppercase letters are not used in various flavors.  I'm
>> tentatively considering using \m and \M (for the NOT version).
>> Another option would be to overload \p and \P by using a different
>> delimiter for the name.  <name> is used for named groups, so \p<url>
>> would be a compatible extension to what other regex dialects use.
>>
>> So, for example, to validate a line containing just a URL, you could
>> use the following regex expression:
>>
>> ^\m{url}$
>>
>> vs. something like this.
>>
>> ^(?#Protocol)(?:(?:ht|f)tp(?:s?)\:\/\/|~/|/)?(?#Username:Password)(?:\w+:\w+@)?(?#Subdomains)(?:(?:[-\w]+\.)+(?#TopLevel
>> Domains)(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2}))(?#Port)(?::[\d]{1,5})?(?#Directories)(?:(?:(?:/(?:[-\w~!$+|.,=]|%[a-f\d]{2})+)+|/)+|\?|#)?(?#Query)(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?#Anchor)(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?$
>>
>> It would be really useful to be able to extract information from the
>> saved patterns using named groups to get pieces of information from a
>> match.  For example a URL pattern might allow you to extract the
>> protocol, domain, port, etc. pieces of the URL by name after a
>> successful match.  My current thinking is to make this information
>> available from the group that contains the pattern reference using the
>> same name as the compiled pattern.  So, to get the host information
>> from a match, you would use something like this:
>>
>> r = p~find(line)
>> if r~matched then do
>>    -- group(0) is the main matching group, get the match information from the
>>    -- url reference and extract its "host" group
>>    say "The target host is" r~group(0)~pattern("url")~group("host")~text
>> end
>>
>> I envision having the base compiler supporting a wide list of common
>> matching patterns, and it will be also be possible to add additional
>> pattern types to the set of callable patterns.
>>
>> Anyway, once I have conditionals done, and implement the comment node
>> that the example above demonstrates I'm also missing, I'll start
>> playing with this capability.
>>
>> Rick
>>
>> On Wed, May 5, 2010 at 12:20 PM, Rick McGuire <object.r...@gmail.com> wrote:
>>
>>>The regular expression incubator project is moving along at a fairly
>>>good pace.  Most of the basics are now implemented and have unit
>>>tests, so many of the standard expression types should be working now.
>>> Stuff I have yet to finish are:
>>>
>>>1)  Unit tests for "lookarounds"
>>
>>
>> Done
>>
>>
>>>2)  Non-capturing groups
>>
>>
>> Done
>>
>>
>>>3)  Atomic non-capturing groups
>>
>>
>> Done
>>
>>
>>>4)  The various option flags, both on the compiler instance and as
>>>flags in the expressions.
>>
>>
>> Done
>>
>>
>>>5)  Conditional patterns
>>>6)  Reuseable patterns
>>>7)  Tests for the parser class
>>>8)  Tests for the split method
>>>9)  Support for /Q /E qualifiers during parsing.
>>
>>
>> Done
>>
>>>
>>>This is mostly a todo list to remind me of what I still need to do.
>>>I'd love if people would start trying this out, or even better, start
>>>writing tests cases for this.  It would be great to take something
>>>like the Regular Expressions Cookbook and test out the different
>>>expressions there in a test group.  Once I'm reasonable comfortable
>>>with the quality of this, I'll start considering doing something like
>>>building regex support into the parse instruction or using regex
>>>filters on the new File class in 4.1 (hint, hint).  However, if there
>>>really doesn't appear to be any interest in this, it will likely stay
>>>in the incubator.
>>>
>>>Rick
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Oorexx-devel mailing list
>> Oorexx-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
>>
>
> --
> Gil Barmwater
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Oorexx-devel mailing list
> Oorexx-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oorexx-devel
>

------------------------------------------------------------------------------

_______________________________________________
Oorexx-devel mailing list
Oorexx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oorexx-devel

Re: [Oorexx-devel] Regular expression progress

Reply via email to