regex README

Steve Fink Sat, 20 Nov 2004 23:55:12 -0800

cvsuser     04/11/20 23:55:04


  Modified:    languages/regex README
  Log:
  Update the state, clarify a bunch of things.
  
  Revision  Changes    Path
  1.13      +138 -41   parrot/languages/regex/README
  
  Index: README
  ===================================================================
  RCS file: /cvs/public/parrot/languages/regex/README,v
  retrieving revision 1.12
  retrieving revision 1.13
  diff -u -r1.12 -r1.13
  --- README    18 Sep 2004 04:22:31 -0000      1.12
  +++ README    21 Nov 2004 07:55:04 -0000      1.13
  @@ -1,5 +1,8 @@
   languages/regex: Compile regular expressions into optimized bytecode.
   
  +NOTE: This Is Not The Matching Engine You're Looking For. See the
  +bottom of this readme, in the section "WHAT ABOUT P6GE?"
  +
   RUNNING
   =======
   
  @@ -23,6 +26,11 @@
     perl regex.pl --optimize=t '(a*|b)a'
     perl regex.pl --optimize=l '(a*|b)a'
   
  +To do the same thing, only generating perl5 backend code instead of
  +PIR code:
  +
  +  perl regex.pl --language=perl5 '(a*|b)a'
  +
   For development, it may also be useful to dump out the tree (remember,
   by default it will be the optimized tree):
   
  @@ -35,28 +43,57 @@
   All of this really ought to be in a usage message.
   
   To run any of the generated code, you now need the Match and
  -MatchRange PMCs. Run 'make' in parrot/dynclasses/ to generate these.
  -It will probably only work on Linux and similar architectures
  -currently; sorry.
  -
  -New stuff: now you can use the compiler as an embedded Parrot compiler.
  -Run 'make regex-compiler.pbc' to generate regex-compiler.pbc. Then run
  -../../parrot 01_basic.imc, ../../parrot 02_date.imc, etc. These will
  -dynamically load in the regex-compiler as the compiler for the "regex"
  -language, and then use it to compile some code and execute it.
  +MatchRange PMCs. They should be generated automatically now by the
  +root makefile, but if for some reason they aren't, run 'make' in
  +parrot/dynclasses/ to generate them.
  +
  +New stuff (CURRENTLY BROKEN): now you can use the compiler as an
  +embedded Parrot compiler. Run 'make regex-compiler.pbc' to generate
  +regex-compiler.pbc. Then run ../../parrot 01_basic.imc, ../../parrot
  +02_date.imc, etc. These will dynamically load in the regex-compiler as
  +the compiler for the "regex" language, and then use it to compile some
  +code and execute it.
  +
  +Also, I have hacked in multiple rule definitions into the silly
  +frontend parser. I don't intend to stick with this syntax for long,
  +since I want to get to a point soon where I can bootstrap this
  +engine's own parser, but for now you can define rules with
  +'&rulename=...'. So to compile a simple expression parser:
  +
  +  perl regex.pl --language=perl5 
'&expr=<?term>\*<?expr>|<?term>\/<?expr>|<?term>&term=<?atom>\+<?term>|<?atom>\-<?term>|<?atom>&atom=[0-9]+|\(<?expr>\)&default=<?expr>'
  +
  +Sorry, no closure actions yet. Coming soon. And yes, that is a mess,
  +but if you format it nicely it should make more sense:
  +
  +  &expr = <?term> \* <?expr>
  +        | <?term> \/ <?expr>
  +        | <?term>
  +
  +  &term = <?atom> \+ <?term>
  +        | <?atom> \- <?term>
  +        | <?atom>
  +
  +  &atom = [0-9]+
  +        | \( <?expr> \)
  +
  +  &default = <?expr>
  +
  +All of the <?foo> things are there so that the rules get nicely
  +captured -- by default, nested rule calls are *not* capturing.
   
   STATUS
   ======
   
   Everything should be more or less working, though many operators are
  -untested. Theoretically, this release should support:
  +untested. Theoretically, this release should support (using Perl5
  +syntax here):
   
    RS      - sequences
    R|S     - alternation
    R*      - greedy Kleene closure
  - R*?     - nongreedy/parsimonious Kleene closure
  + R*?     - nongreedy (parsimonious) Kleene closure
    R?      - greedy optional
  - R??     - nongreedy/parsimonious optional
  + R??     - nongreedy (parsimonious) optional
    R+      - greedy one or more ("more or one")
    R+?     - nongreedy one or more
    (R)     - capturing groups
  @@ -84,6 +121,7 @@
    (?(cond)R|S)
            - conditional expression
    (R?)*   - empty match suppression
  + <rule>  - rule invocation (Perl6 feature)
   
   Missing perl6 features:
   :
  @@ -97,25 +135,21 @@
   %var     - hash key alternatives
   <%var>   - as above, but values are always regexps
   <!...>   - negation (what is this?) (is this just negative lookahead?)
  -<rule>   - rule invocation
   ...      - and many, many, many more
   
   Other features I'd like to have:
  - (?1)    - recursive expressions
    @A =~   - array matching
    ?       - reentrant, suspendable, coroutinish regexes
    ?       - two-dimensional regexes
   
  -Regular expressions are compiled down to regular opcodes (not the
  -rx_* set of opcodes). P0 and P1 are PerlArrays containing the starting
  -and ending indexes, respectively, of () groups. The user stack is used
  -as the backtracking stack. See rx.ops for a good description of how
  -operators are converted to code sequences (except I don't use the rx_*
  -ops that it describes). Marks are the value -1; indices are
  -nonnegative integers. (Except in debugging mode, when marks are
  -instead strings describing what they're marking.) For details on
  -exactly how things are translated with this compiler, read Rewrite.pm
  -and Rewrite/Stackless.pm.
  +Regular expressions are compiled down to regular opcodes (not the rx_*
  +set of opcodes). A *Array PMC of some sort is used as the backtracking
  +stack. See rx.ops for a good description of how operators are
  +converted to code sequences (except I don't use the rx_* ops that it
  +describes). Marks are the value -1; indices are nonnegative integers.
  +(Except in debugging mode, when marks are instead strings describing
  +what they're marking.) For details on exactly how things are
  +translated with this compiler, read Rewrite.pm.
   
   "Local" variables:
    * One integer temporary, not preserved between tree op rewrites
  @@ -134,18 +168,9 @@
   
   Future plans:
   
  -test.pl generates stack cleanup code, but regex.pl does not. I'm still
  -not sure exactly where that code ought to go, because especially when
  -I really start on Perl6 regular expressions, the ability to resume a
  -match will be critical.
  -
   Relatively soon, I would like to add array-based regular
   expressions. A simple cut of this should be nearly trivial.
   
  -I'd really like to put in recursive regular expressions. See
  -<http://www.puffinry.freeserve.co.uk/regex-extension.html>.
  -UPDATE: Looks like Larry liked the idea too. Perl6 has 'em.
  -
   Near-term optimizations planned:
   
    Simple subexpression alternation: the code for alternations can be
  @@ -164,19 +189,27 @@
       f    -> $start_S
       else -> backtrack
   
  - Multi-character literals: currently, "abc" expands to "match a then
  - match b then match c". I don't plan to do a substring match anytime
  - soon, but I would like to eliminate two of the three end-of-input tests.
  + Length check suppression: if you try to match /abc/, the unoptimized
  + engine will check to be sure there's at least one character left,
  + then match 'a', then check to be sure there's still at least one
  + character left, then match 'b', etc. The tree optimization pass will
  + transform this into: check if there are at least 3 characters left,
  + then match 'a', then match 'b', then match 'c'. Obviously, this
  + particular case might be further optimized by doing some kind of
  + fixed substring match (or a Boyer-Moore "skip a number of characters
  + determined by the input and the fixed substring"), but this
  + optimization works for more general cases (such as
  + C<< a.[0-9](x|\w) >>).
   
   Longer-term optimization vague ideas:
   
  - Find maximal subsequences of regex ops that can be converted to
  - DFAs. Translate them into in-line DFAs. The jump tables above are a
  + Find maximal subtrees of regex ops that can be converted to DFAs.
  + Translate them into in-line DFAs. The jump tables above are a
    primitive form of this. The hard part is figuring out whether a DFA
    would produce exactly the same results as an NFA for a given
  - expression. (It's a cost/benefit game. Some expressions
  - trivially behave the same, some trivially behave differently, and some
  - are difficult to determine.)
  + expression. (It's a cost/benefit game. Some expressions trivially
  + behave the same, some trivially behave differently, and some are
  + difficult to determine.)
   
   BUGS
   ====
  @@ -197,3 +230,67 @@
   http://0xdeadbeef.net/wiki/wiki.pl?FinkBlog
   
   Original author: Steve Fink <[EMAIL PROTECTED]>
  +
  +WHAT ABOUT P6GE?
  +================
  +
  +This rule compiler is not the officially blessed one. That would be
  +Patrick Michaud's p6ge, which you can find in compilers/p6ge. This
  +engine predates p6ge by a year or two, but never managed to generate
  +sufficient interest to get anyone else involved. Patrick must have
  +been aware of this engine, since I told him about it both personally
  +and in a message to perl6-internals, but he has never acknowledged its
  +existence nor explained why he felt the need to start from scratch. I
  +have to conclude that he either looked at it and didn't like the
  +design or the implementation; or he just wanted to start from scratch
  +so that he could fully understand the system he was working on. All of
  +which are perfectly good reasons, so I bear no ill will towards the
  +official effort.
  +
  +I am assuming that p6ge is going to get the momentum of the community
  +behind it, so I would advise anyone interested in working on a rule
  +engine to look there first. (Look for discussion on the
  +perl6-internals and perl6-compiler mailing lists.) However, I still
  +intend to work on this engine for a while longer, and welcome any
  +interested participants. (Send any requests/comments/suggestions
  +either to perl6-internals or directly to me at [EMAIL PROTECTED]) So
  +far, I have only briefly looked at p6ge, but I think this
  +languages/regex engine has enough of a different approach that it is
  +still valuable for gathering lessons -- and may still make the most
  +sense in the long run.
  +
  +That last statement demands a bit of explanation, so here's an excerpt
  +of a mail I sent out after my first look at p6ge (remember, p6ge has
  +probably advanced past this point by now):
  +
  +It sounds like languages/regex handles pretty much exactly the same
  +things as p6ge, probably a few more and a few less. I haven't actually
  +looked at the code, but from the description I'd guess that the main
  +differences are:
  + 
  +  - p6ge is implemented in C; l/rx in Perl5
  +  - p6ge generates PIR; l/rx has both PIR and Perl5 backends
  +  - p6ge uses coroutines and continuations; I have always been too
  +    wary of their stability, so I use plain subs (with a 'mode'
  +    parameter to tell it whether to try to match or backtrack)
  +  - Both allow you to "continue" a match to find all other possible
  +    matches, but that's really because it falls out of any sane
  +    implementation (you have to keep all that state around somehow
  +    anyway)
  +  - l/rx uses match objects (dynclasses/match.pmc) and automatically
  +    generates a parse tree out of them
  +  - p6ge has built-in "dump out the matching info" routines; I make my
  +    test harnesses generate their own. I'm jealous.
  +  - The feature sets are nearly identical.Makes sense, I suppose --
  +    low-hanging fruit and all that.
  +  - It sounds like the internal design is rather different. I try hard
  +    to compile down to very minimalistic PIR ops. It sounds like p6ge
  +    uses lots of higher-level operations, to do things like processing
  +    a whole chunk at a time. (Although on the other hand, p6ge uses
  +    more native Parrot flow control mechanisms than I do.)
  +  - Closely related to the above, I have a number of optimizations
  +    already implemented, but I suspect p6ge will end up with a very
  +    different set of optimizations.
  +  - I have on average about 5 hours a week to work on l/rx; Patrick
  +    has quite a bit more :-) (Which does NOT mean that I work faster;
  +    my engine is at least a year older than p6ge.)

cvs commit: parrot/languages/regex README

Reply via email to