Since A5 is the next thing up, it's probably worthwhile to try and get any changes we want to regexes before they get finalized. :)
The (?something .. ) syntax which was used to extend the lifetime of the parenthesis in Perl5 was due to the lack of matched delimiters. With a little discussion that happened between DrForr, brentdax, and I, we came up with using {} as another set of delimiters. One proposal had their use as being unambiguous, as long as you didn't have a number as the first thing after the opening {. {\d would be the old use, and {\D would be the new use. I didn't like how everything got squeezed into Perl5's (? .. ) syntax. They were: (?#..) (?imsx-imsx:..) (?:..) (?=..) (?!..) (?<=..) (?<!..) (?{..}) (??{..}) (?>..) (?(..)..|..) >From what I understand (?( and (?{ can both be combined into a smarter use of (??{ . (?($a)$b|$c) => (??{ $a ? $b : $c }) #using perl5 ?: operator here (?{$a}) => (??{ $a; qr// }) (??{$a}) => (??{ $a }) I have a feeling Larry has already weighed in on that particular issue, but I don't remember what he said. The various zero-width (positive|negative) look-(behind|ahead) assertions could be combined into some sort of orthogonal system. For example, assuming we have {} available, we can define the operater roughly like: {(\S+) (.*?)} (assuming that .*? handles nestedness properly) The values of \S+ would be split into characters, and each character would implement some directive. So for example: = try to match what follows -- positive assertion. the default. ! match the regex that follows and fail -- negative assertion > match forwards. the default. < match backwards. This would be followed by a sexeger. Limiting it to be a lookbehind would necessarily restrict what follows to be a constant string, and not a regex. If you want to implement | negative width match. | is skinny, and doesn't take up any room _ regular matching. the default. _ is wide and does take up room. (this assumes you program perl in non-monospaced fonts ;) And luckily, these don't destroy the non-termination quality of regexes, either. So by default, all regexes are surrounded by {_=> } to indicate they try to match what follows, consume text, and are heading forwards. Here's an example of a convoluted use of these: "regexes" =~ /({> regexes{< sexeger})regexes}{< sexeger} {|= sexeger}{!| regexes}/ succeeds, with $& eq ''. The order characters are matched is: regexessexegerregexessexeger(negative zero-width attempt on sexeger, which fails, so match continues)(postive zero-width attempt on regexes, which succeeds, so match succeeds). At the start of the string. $1 and $2 and so on would reflect the characters being matched. So $1 in the above example would be regexes. This destroys the COW optimizations we can do on $1, $`, $', and so on to avoid making large copies of strings everywhere. However, we only need to kill this optimization *if* the regex contains {< ..}. If it doesn't we can just use indices into the string to keep the overhead low. brentdax asked me for a use of {< blah} to change within a regex, but I was unable to come up with one. We both see the need for regexes that match backwards as a whole, but he argues that switching in the middle of a regex match doesn't have a purpose. Can anyone come up with a useful regexesexeger? :) I still think we should be able to change direction mid-regex, just because it's cooler, and fits in well with the system I've described. :) Now, for the mapping of the (?..'s into the { syntax. I'm going to give the full-out form, although they can often be simplified if you know something about the context in which you are placing the item (which you often do). The third column represents what the thing would look like in the context of a regular unmodified regex. Perl5 Generic-CompletelySpecified Perl6-inRegularRegex (?: {_=> { (?= {|=> {| (?! {|!> {|! (?<=..) {|<= \Q({reverse '..'})\E} {|< \Q({reverse '..'})\E} (?<!..) {|<! \Q({reverse '..'})\E} {|<! \Q({reverse '..'})\E} Pros/cons: - Non-capturing grouping is now much easier: {a|b|c}+ - User-extensible to support additional meta characters. Want to add something which does god-knows what? Register a character along with compilation logic for the contents. The above all do not require changes to the current Parrot regex engine, and were likely implemented in Perl5 only because it was it didn't require drastic changes to the nature of the engine there either. It's mostly just a compilation detail. Cons: - ! and | are way too similar. Better symbol for | is welcome. - User has to remember exactly how to do lookaheads using the operators. Of course, they needed to remember what the syntax was before, so it's not like the current version will be much worse. - Need to manually reverse $text, and it looks much worse. Perhaps this particular "forwards-reading text, but match before us, starting length '..' before" idiom can be Huffman-encoded somehow? And for example, one can register 'e' as eval, and use the misx characters to turn on/off those symbols for the thing enclosed in braces. The problem with this latter idea is that the proposed system does not support turning off of these pragmas. Perhaps we could make '-' a special character which acts like it did in (?misx-misx: ? One could register 'e'=>eval and 'r'=>regex-interpolate to handle: {e print "hello"} and {r $code = calculation($&);qr/$code/} (Assuming you want to keep these seperate, and don't want to force 'e' to become {r print "hello";qr//} where qr// represents an empty regex that doesn't insert anything into the regex stream. I think I'll stop here, as I think I've said more than enough. Mike Lambert