Re: A serious stab at regexes

Angel Faus Sat, 10 Nov 2001 18:11:48 -0800

Hi Brent,

>
> It just means you have to be more explicit.  I consider that a Good
> Thing--Perl 5's regular expressions are compact enough to be represented
> like: but the internals to support them are an absolute jungle.  I'd rather
> have a few exposed ops than have a chunk of code like Perl 5's regular
> expressions.  Besides, being able to /see/ where we jump to on each op
> when we fail will help when debugging the RE compiler and such.
>


I totally agree. It is not about having fewer opcodes or a more compact 
syntax, but a more maintainable system, I just pretend that having a stack 
were you predeclare the possible continuation paths is simpler than your 
marks idea. 

I could be biased here of course. I would love some comments (specially of 
someone with experience in this area) about whether were are going the right 
way here. 

> How do you plan to support lookahead?  Unless I'm mistaken, with my
> proposal it goes something like:
>
>       rePushindex
>       rePushmark
>       #lookahead code in here
>       rePopmark
>       rePopindex
>

This could be solved by having a new pair of ops (reGetIndex /reSetIndex) 
that save and restore the current index into an INTVAL register:

        reGetIndex I1 
        #lookahead code in here
        reSetIndex I1

So lookahead would be an special case, but that's fine, because it _is_ an 
special case.

> There are advantages and disadvantages to both proposals.  The question
> is, which one (or both) is flexible enough to do what we want it to do?
>

Probably both proposals are good enough, so I would say that we should 
choose one (not necessarily mine's) and go ahead. I would love to see some 
regex support commited soon (maybe as an oplib) so we can hack the languages 
(babyperl and others) and gain some experiencie about performance and 
optimitzations...

¿Is there any plan about when to commit regular expression support on Parrot? 

Dam Sugalsky has said that he is not interessed at all in the design of regex 
opcodes.  Who is going to have a final say on this? 

Just an idea, but ¿could we have someone with experience in perl5 regex 
internals playing the role of "regex pumpking"?

>
> I look forward to seeing it.
>

I am attaching a patch that implements the modifications I suggest to your 
code. The tests and the compiler are updated too and work fine.

(btw, I am not sure about if I choosed the right options on the diff command 
¿could someone please tell me on private what's the right way of submiting 
patches when there are various files?)

There is another (totally independent) question that bothers me: 

¿should we use the regex flags on compile time to perform agressive 
optimitzations, or should we rely on runtime checks to get the job done?

I'd better explain myself with an example:

We have (in these patches) an opcode called reAnyhing. When called, it testes 
the RE_single_line flag, and acts in consequence.

A small optimitzation could be gained if we had two versions of the op (like 
reAnythingML [multi-line version] and reAnythingsSL [single-line version]) 
and choosed between them on compile-time.

This can be applied to most flags, and would hopefully result in a speed 
increase (less runtime checks), at the cost of using more opcodes numbers.

Another example: case-insensitive matches can be implemented as run-time 
checks on the corresponding flag (that's the way most regex engines do it), 
or by normalizing the regular expression to lower case on compile time, and 
then converting the string to lower case just at the begining of the match.

(The former is actually propsed by Jeffrey Friedl in the Mastering Regexs 
book claiming that it would be much faster)

How agressive are we going to be with this kind of optimitzations? Or are we 
going to rely in a very smart JIT compiler?

---------------
Angel Faus
[EMAIL PROTECTED]

diff -crN parrot_current/parrot/Parrot/OpsFile.pm 
parrot_patched/parrot/Parrot/OpsFile.pm
*** parrot_current/parrot/Parrot/OpsFile.pm     Fri Oct 26 03:00:01 2001
--- parrot_patched/parrot/Parrot/OpsFile.pm     Sat Nov 10 16:08:44 2001
***************
*** 22,27 ****
--- 22,42 ----
  #my %opcode;
  #my $opcode;
  
+ my $backtrack_macro = <<EOM;
+ {
+ 
+   opcode_t *dest;
+   if(stack_depth(interpreter, cur_re->stack_base)){
+     pop_generic_entry(interpreter, &cur_re->stack_top, &cur_re->index, 
+STACK_ENTRY_INT);
+     pop_generic_entry(interpreter, &cur_re->dest_stack_top, &dest, 
+STACK_ENTRY_DESTINATION);
+     return dest;
+   }
+   else {
+     return cur_re->onfaildest;
+   }
+ }
+ EOM
+ 
  
  #
  # trim()
***************
*** 217,235 ****
        #
  
        $body =~ s/HALT/{{=0}}/mg;
!       
        $body =~ s/RESTART\(\*\)/{{=0,+=$op_size}}/mg;
        $body =~ s/RESTART\((.*)\)/{{=0,+=$1}}/mg;
!       
        $body =~ s/RETREL\(\*\)/{{+=$op_size}}/mg;
        $body =~ s/RETREL\((.*)\)/{{+=$1}}/mg;
!       
        $body =~ s/RETABS\((.*)\)/{{=$1}}/mg;
!       
        $body =~ s/\$(\d+)/{{\@$1}}/mg;
!       
        $op->body($body);
!       
        $self->push_op($op);
        $counter++;
    }
--- 232,252 ----
        #
  
        $body =~ s/HALT/{{=0}}/mg;
! 
        $body =~ s/RESTART\(\*\)/{{=0,+=$op_size}}/mg;
        $body =~ s/RESTART\((.*)\)/{{=0,+=$1}}/mg;
! 
        $body =~ s/RETREL\(\*\)/{{+=$op_size}}/mg;
        $body =~ s/RETREL\((.*)\)/{{+=$1}}/mg;
! 
        $body =~ s/RETABS\((.*)\)/{{=$1}}/mg;
! 
        $body =~ s/\$(\d+)/{{\@$1}}/mg;
! 
!       $body =~ s/BACKTRACK\(\)/$backtrack_macro/mg;
! 
        $op->body($body);
! 
        $self->push_op($op);
        $counter++;
    }
diff -crN parrot_current/parrot/core.ops parrot_patched/parrot/core.ops
*** parrot_current/parrot/core.ops      Tue Nov  6 15:00:01 2001
--- parrot_patched/parrot/core.ops      Sat Nov 10 16:08:25 2001
***************
*** 2,7 ****
--- 2,9 ----
  ** core.ops
  */
  
+ #include <parrot/re.h>
+ 
  =head1 NAME
  
  core.ops
***************
*** 1930,1935 ****
--- 1932,2317 ----
  
  ###############################################################################
  
+ =head2 Regular expression operations
+ 
+ These operations are used by the regular expression engine.  Unless 
+ otherwise noted, any regexp opcode which takes an integer constant as its 
+ last argument branches to that address if the op fails to match.
+ 
+ =over 4
+ 
+ =cut
+ 
+ ########################################
+ 
+ =item B<reMatch>(ic, s)
+ 
+ =item B<reMatch>(ic, sc)
+ 
+ Sets the string in $2 as the string to match against and branches to the 
+ regular expression at the address specified as $1.
+ 
+ =item B<reMatch>(i, s)
+ 
+ =item B<reMatch>(i, sc)
+ 
+ Same as the ic variant, but jumps to $1 instead of branching.
+ 
+ =cut
+ 
+ AUTO_OP reMatch(ic, s|sc) {
+   cur_re=mem_sys_allocate(sizeof(re_info));
+   
+   cur_re->string=$2;
+   cur_re->flags=0;
+   cur_re->index=0;
+   cur_re->minlength=0;
+   
+   /*
+   ** Allocate both stacks
+   ** XXX There ought to be a function to do this, like
+   **  stack_make(interpreter, &cur_re->stack_base, &cur_re->stack_top) or something 
+   */
+ 
+   cur_re->stack_base = mem_allocate_aligned(sizeof(struct StackChunk));
+   cur_re->stack_top = &cur_re->stack_base->entry[0];
+   cur_re->stack_base->used = 0;
+   cur_re->stack_base->free = STACK_CHUNK_DEPTH;
+   cur_re->stack_base->next = NULL;
+   cur_re->stack_base->prev = NULL;
+   
+   cur_re->dest_stack_base = mem_allocate_aligned(sizeof(struct StackChunk));
+   cur_re->dest_stack_top = &cur_re->stack_base->entry[0];
+   cur_re->dest_stack_base->used = 0;
+   cur_re->dest_stack_base->free = STACK_CHUNK_DEPTH;
+   cur_re->dest_stack_base->next = NULL;
+   cur_re->dest_stack_base->prev = NULL;
+     
+   /* push the current location onto the call stack--we're doing the equivalent of a 
+sub call */
+   push_generic_entry(interpreter, &interpreter->control_stack_top, cur_opcode + 3,  
+STACK_ENTRY_DESTINATION, NULL);
+     
+   RETREL($1);
+ }
+ 
+ AUTO_OP reMatch(i, s|sc) {
+   cur_re=mem_sys_allocate(sizeof(re_info));
+   
+   cur_re->string=$2;
+   cur_re->flags=0;
+   cur_re->index=0;
+   cur_re->minlength=0;
+ 
+   /*
+   ** Allocate both stacks
+   ** XXX There ought to be a function to do this, like
+   **  stack_make(interpreter, &cur_re->stack_base, &cur_re->stack_top) or something 
+   */
+ 
+   cur_re->stack_base = mem_allocate_aligned(sizeof(struct StackChunk));
+   cur_re->stack_top = &cur_re->stack_base->entry[0];
+   cur_re->stack_base->used = 0;
+   cur_re->stack_base->free = STACK_CHUNK_DEPTH;
+   cur_re->stack_base->next = NULL;
+   cur_re->stack_base->prev = NULL;
+   
+   cur_re->dest_stack_base = mem_allocate_aligned(sizeof(struct StackChunk));
+   cur_re->dest_stack_top = &cur_re->stack_base->entry[0];
+   cur_re->dest_stack_base->used = 0;
+   cur_re->dest_stack_base->free = STACK_CHUNK_DEPTH;
+   cur_re->dest_stack_base->next = NULL;
+   cur_re->dest_stack_base->prev = NULL;
+     
+   /* push the current location onto the call stack--we're doing the equivalent of a 
+sub call */
+   push_generic_entry(interpreter, &interpreter->control_stack_top, cur_opcode + 3,  
+STACK_ENTRY_DESTINATION, NULL);
+     
+   /* jump to the first argument */
+   RETABS((opcode_t *)$1);
+ }
+ 
+ ########################################
+ 
+ =item B<reFlags>(s)
+ 
+ =item B<reFlags>(sc)
+ 
+ Sets the regular expression's flags.  'i' sets the flag
+ RE_case_insensitive_FLAG, 's' sets the flag
+ RE_single_line_FLAG, and 'm' sets the flag
+ RE_multiline_FLAG.  Currently only 's' is implemented.
+ 
+ =cut
+ 
+ AUTO_OP reFlags(s|sc) {
+   int i;
+   char ch;
+   
+   for(i=0; i < string_length($1); i++) {
+     /*
+     ** XXX this is a REALLY naughty thing to do--I 
+     **  shouldn't poke around inside the string like this 
+     */
+     ch=((char *)$1->bufstart)[i];
+     
+     switch(ch) {
+       case 'i':
+         fprintf(stderr, "Warning: RE option /m not yet implemented");
+         RE_case_insensitive_SET(cur_re);
+         break;
+ 
+       case 's':
+         RE_single_line_SET(cur_re);
+         break;
+ 
+       case 'm':
+         fprintf(stderr, "Warning: RE option /m not yet implemented");
+         RE_multiline_SET(cur_re);
+         break;
+ 
+       default:
+         fprintf(stderr, "Warning: unrecognized RE option /%c", ch);
+     }
+   }
+ }
+ 
+ ########################################
+ 
+ =item B<reMinlength>(i)
+ 
+ =item B<reMinlength>(ic)
+ 
+ Sets the minimum number of characters that must be left in the
+ string for a match to be possible.  For example, the expression
+ /fo*bar/ must have at least 4 characters in the string left to
+ match; the expression /fo+bar/ requires five characters.  This
+ information is used to optimize calls to B<reAdvance>.
+ 
+ =cut
+ 
+ AUTO_OP reMinlength(i|ic) {
+   cur_re->minlength=$1;
+ }
+ 
+ 
+ ########################################
+ 
+ =item B<reOnFail)
+ 
+ =item B<reOnFail>(ic)
+ 
+ Sets the ultimate failing destination, the one that should be called
+ when a submatch fails and there is nothing in the backtracking stacks.
+ 
+ =cut
+ 
+ AUTO_OP reOnFail(i|ic) {
+   cur_re->onfaildest=cur_opcode + $1;
+ }
+ 
+ 
+ ########################################
+ 
+ =item B<reLiteral>(s, ic)
+ 
+ =item B<reLiteral>(sc, ic)
+ 
+ Matches the string in $1 literally; in other words, if $1="bar",
+ this op will match the exact string "bar" and nothing else.
+ (This should be sensitive to RE_case_insensitive_FLAG but isn't
+ currently.)
+ =cut
+ 
+ AUTO_OP reLiteral(s|sc) {
+   STRING * arg=$1;
+   STRING * cmp=string_make(interpreter, "", 0, 0, 0, 0);
+   
+   if(cur_re->index >= string_length(cur_re->string)) {
+     BACKTRACK();
+   }
+ 
+   string_substr(
+     interpreter,
+     cur_re->string,
+     cur_re->index,
+     string_length(arg),
+     &cmp
+     );
+ 
+   if(!string_compare(interpreter, arg, cmp)) {
+     cur_re->index += string_length(arg);
+   }
+   else {
+     BACKTRACK();
+   }
+ }
+ 
+ ########################################
+ 
+ =item B<reOneof>(s)
+ 
+ =item B<reOneof>(sc)
+ 
+ Matches if the next character in the string being matched against
+ is in $1.  (This should be sensitive to RE_case_insensitive_FLAG
+ but isn't currently.)
+ 
+ =cut
+ 
+ AUTO_OP reOneof(s|sc) {
+   int i;
+   STRING * arg=$1;
+   STRING * matchagainst=string_make(interpreter, "", 0, 0, 0, 0);
+   STRING * nextchar=string_make(interpreter, "", 0, 0, 0, 0);
+ 
+   if(cur_re->index >= string_length(cur_re->string)) {
+     BACKTRACK();
+   }
+ 
+   string_substr(
+     interpreter,
+     cur_re->string,
+     cur_re->index,
+     1,
+     &matchagainst
+   );
+ 
+   for(i=0; i < string_length(arg); i++) {
+     string_substr(
+       interpreter,
+         arg,
+         i,
+         1,
+         &nextchar
+       );
+ 
+     if(!string_compare(interpreter, matchagainst, nextchar)) {
+       cur_re->index++;
+       RETREL(*);
+     }
+   }
+ 
+   BACKTRACK();
+ }
+ 
+ ########################################
+ 
+ =item B<reAnything>()
+ 
+ This behaves the same as '.' in a regular expression; if
+ RE_single_line_FLAG is set, it matches any character, otherwise
+ it matches only newline.
+ 
+ =cut
+ 
+ AUTO_OP reAnything() {
+   STRING * newline=string_make(interpreter, "\n", strlen("\n"), 0, 0, 0);
+   STRING * cmp=string_make(interpreter, "", 0, 0, 0, 0);
+ 
+   if(cur_re->index >= string_length(cur_re->string)) {
+     BACKTRACK();
+   }
+ 
+   string_substr(
+     interpreter,
+     cur_re->string,
+     cur_re->index,
+     1,
+     &cmp
+   );
+ 
+   if(RE_single_line_TEST(cur_re) || string_compare(interpreter, newline, cmp)) {
+     cur_re->index++;
+   }
+   else {
+     BACKTRACK();
+   }
+ }
+ 
+ ########################################
+ 
+ =item B<reAdvance>()
+ 
+ This op skips forward one character in the string; it is used to
+ walk forward through the string.  For example, in the expression
+ C<"afoobarz" =~ /fo*bar/>, the 'f' doesn't immediately match the
+ 'a', so it skips forward a letter; this op provides that behavior.
+ 
+ =cut
+ 
+ AUTO_OP reAdvance() {
+   cur_re->index++;
+ 
+   if(cur_re->index+cur_re->minlength >= string_length(cur_re->string)) {
+     BACKTRACK();
+   }
+ }
+ 
+ ########################################
+ 
+ =item B<rePushindex>(i)
+ 
+ =item B<rePushindex>(ic)
+ 
+ This op pushes the current index onto the regular expression's stack.  It is
+ used to remember indexes to backtrack to later.
+ 
+ It also now pushes the $1 onto the regex destination stack. So, when there is a
+ submatch fail in the regex, it will backtrack to the right string index and
+ to the rigth regex node.
+ 
+ =cut
+ 
+ AUTO_OP rePushindex(ic|i) {
+   push_generic_entry(
+     interpreter,
+     &cur_re->stack_top,
+     &cur_re->index,
+     STACK_ENTRY_INT,
+     NULL
+   );
+ 
+   push_generic_entry(interpreter, &cur_re->dest_stack_top, cur_opcode + $1,
+     STACK_ENTRY_DESTINATION, NULL);
+ 
+ }
+ 
+ ########################################
+ 
+ =item B<reGetIndex>(i)
+ 
+ Sets $1 to current value of the index. Used for lookaheads.
+ 
+ 
+ ########################################
+ 
+ =item B<reSetIndex>(i)
+ 
+ Replaces the last item of the regex index stack with $1. Do not confuse with 
+rePushIndex.
+ Used for lookaheads.
+ 
+ ########################################
+ 
+ =item B<reFinished>()
+ 
+ This op does some cleanup so various data structures used by the regular
+ expression engine will be garbage collected and jumps back to the next op
+ after the reMatch that entered the regular expression.
+ 
+ =cut
+ 
+ 
+ AUTO_OP reFinished() {
+   opcode_t *dest;
+ 
+   (void *)cur_re->stack_top=(void *)cur_re->stack_base=NULL;  /* so it'll be GCed */
+   (void *)cur_re->dest_stack_top=(void *)cur_re->dest_stack_base=NULL;
+   cur_re=NULL; /* ditto */
+ 
+   pop_generic_entry(interpreter, &interpreter->control_stack_top, &dest, 
+STACK_ENTRY_DESTINATION);
+   RETABS(dest);
+ }
+ 
+ ###############################################################################
+ 
  =head1 COPYRIGHT
  
  Copyright (C) 2001 Yet Another Society. All rights reserved.
***************
*** 1940,1943 ****
--- 2322,2326 ----
  as the Parrot interpreter itself.
  
  =cut
+ 
  
diff -crN parrot_current/parrot/include/parrot/re.h 
parrot_patched/parrot/include/parrot/re.h
*** parrot_current/parrot/include/parrot/re.h   Wed Dec 31 19:00:00 1969
--- parrot_patched/parrot/include/parrot/re.h   Sat Nov 10 18:26:17 2001
***************
*** 0 ****
--- 1,101 ----
+ #if !defined(PARROT_RE_H_GUARD)
+ #define PARROT_RE_H_GUARD
+ 
+ #include <parrot/parrot.h>
+ #include <parrot/string.h>
+ #include <parrot/stacks.h>
+ 
+ /*
+ ** I'm stashing notes on the regex implementation here.
+ ** In Perl 5, the RE engine is built on a simple principle.  A printout of the
+ ** structure of the RE /fo*bar/ (obtained with -mre=debug) is below:
+ **
+ **   1: EXACT <f>(3)
+ **   3: STAR(6)
+ **   4:   EXACT <o>(0)
+ **   6: EXACT <bar>(8)
+ **   8: END(0)
+ **
+ ** The basic principle is that, if the operation matches, we jump to the number
+ ** in the parenthesis.  In this RE system, the exact opposite is true; we jump
+ ** when we _fail_.
+ **
+ **
+ **   RE:
+ **     reFlags ""
+ **     reMinlength 4
+ **  
+ **     branch $start
+ **
+ **   $advance:
+ **     rePopindex
+ **     reAdvance $fail
+ **   $start:
+ **     rePushindex
+ **     reLiteral "f", $advance
+ **   $findo:
+ **     literal "o", $findbar
+ **     rePushindex
+ **     branch $findo
+ **   $findbar:
+ **     reLiteral "bar", $backtrack
+ **     set I0, 1  #true
+ **     reFinished
+ **   $backtrack:
+ **     rePopindex $advance
+ **     branch $findbar
+ **   $fail:
+ **     set I0, 0  #false
+ **     reFinished
+ **
+ ** Although this looks like more code than the compact setup above, we're using
+ ** normal opcodes, so we have to be explicit.  There's no STAR op; the branch
+ ** implicitly does that, and the rePushindex helps it out.  Nothing is 'nested'
+ ** within something else--backtracking has to be explicit, and the pushing/popping
+ ** must also be explicit.  Even the behavior of starting at the next spot in the
+ ** string has to be explicitly laid out.  THESE ARE ALL SIDE EFFECTS OF THE FACT
+ ** WE'RE USING NORMAL OPCODES.  Further, these opcodes have very little communication
+ ** between them--a small structure (defined later on in this file) contains just the
+ ** string we're matching against, the index we're at right now, some metadata about
+ ** the RE, and a stack.  The size of the program isn't because we're jumping on
+ ** false instead of true.  The only place that isn't a win is alternation.
+ **
+ ** Okay, I'm done rambling.  Back to the code...
+ */
+ 
+ typedef struct re_info {
+   STRING * string;
+   INTVAL index;
+   INTVAL flags;
+   INTVAL minlength;
+ 
+   /* index stack stuff */
+   struct StackChunk * stack_base;
+   struct Stack_Entry * stack_top;
+ 
+   /* destination stack stuff*/
+   struct StackChunk * dest_stack_base;
+   struct Stack_Entry * dest_stack_top;
+ 
+   /* ultimate failing destination*/
+   opcode_t *onfaildest;
+ } re_info;
+ 
+ re_info * cur_re;
+ 
+ #define RE_case_insensitive_FLAG          0x1
+ #define RE_case_insensitive_TEST(info)    info->flags &   RE_case_insensitive_FLAG
+ #define RE_case_insensitive_SET(info)     info->flags |=  RE_case_insensitive_FLAG
+ #define RE_case_insensitive_CLEAR(info)   info->flags &= ~RE_case_insensitive_FLAG
+ 
+ #define RE_single_line_FLAG               0x2
+ #define RE_single_line_TEST(info)         info->flags &   RE_single_line_FLAG
+ #define RE_single_line_SET(info)          info->flags |=  RE_single_line_FLAG
+ #define RE_single_line_CLEAR(info)        info->flags &= ~RE_single_line_FLAG
+ 
+ #define RE_multiline_FLAG                 0x4
+ #define RE_multiline_TEST(info)           info->flags &   RE_multiline_FLAG
+ #define RE_multiline_SET(info)            info->flags |=  RE_multiline_FLAG
+ #define RE_multiline_CLEAR(info)          info->flags &= ~RE_multiline_FLAG
+ 
+ #endif
diff -crN parrot_current/parrot/languages/regex/BabyRegex.pm 
parrot_patched/parrot/languages/regex/BabyRegex.pm
*** parrot_current/parrot/languages/regex/BabyRegex.pm  Wed Dec 31 19:00:00 1969
--- parrot_patched/parrot/languages/regex/BabyRegex.pm  Sat Nov 10 16:10:29 2001
***************
*** 0 ****
--- 1,349 ----
+ 
+ 
+ package BabyRegex;
+ 
+ use YAPE::Regex 'BabyRegex';
+ use strict;
+ use vars '$VERSION';
+ 
+ $VERSION = '0.01';
+ 
+ my %modes = ( on => '', off => '' );
+ 
+ sub buildtree {
+   my $self = shift;
+   
+   my $cnt = 0;
+   my $groupscnt;
+   my @groups;
+   my @tree;
+   
+   while (my $node = $self->next) {            
+     
+     $node->id($cnt++);
+     $tree[-1]->next($node) if @tree;  
+     
+     if ($node->type =~ /capture|group/) {
+       push @groups, $node;
+       $node->{ALTS} = [];
+       $node->{COUNT} = $groupscnt++;
+       }       
+       
+     if ($node->type eq "alt")          {
+       push (@{$groups[-1]->{ALTS}}, $node);
+       $node->{GROUP} = $groups[-1];
+       }
+     
+     if ($node->type eq "close"){
+       my $groupnode = pop @groups;
+       $groupnode->{CLOSED} = $node;
+       $node->{GROUP} = $groupnode;
+       }
+     push (@tree, $node);      
+     }
+     
+   return @tree;  
+    
+ }
+ 
+ sub cry {
+   if (@_[1]) {
+       my $label = shift;
+       my $opcode = shift;
+   
+       print $label. ":\n";
+       print "     $opcode\n";
+   }
+   else {
+       my $opcode = shift;
+       print "     $opcode\n";
+   }
+ 
+ }
+ 
+ 
+ sub pasm {
+   my $self = shift;  
+   my @tree = $self->buildtree;   
+   
+   cry 'RE', 'reOnFail $fail';
+    
+   cry '$init', 'pushIndex $advance';
+   
+   for my $node (@tree) {
+     $node->pasm($self);               
+   }
+   
+   cry '$ok', "set I1, 1";
+   cry "reFinished";
+   
+   cry '$fail', "set I1, 0";
+   cry "reFinished";
+   
+   cry '$advance', "reAdvance";
+   cry 'branch $init';
+   
+   }
+ 
+ 
+ 
+ ##
+ ## shared methods
+ ##
+ 
+ sub BabyRegex::Element::id {
+    my $self = shift;
+    my $id = shift;
+    
+    if ($self->{ID}) { return $self->{ID} }
+    else {
+        $self->{ID} = '$L' . $id;
+    }
+ 
+ }
+ 
+ 
+ sub BabyRegex::Element::next {
+    my $self = shift;
+    my $next = shift;
+    
+    if ($next) {
+       $self->{NEXT} = $next;
+       return $next;
+    }
+    else {
+       return $self->{NEXT}
+    }
+    
+ }
+ 
+ 
+ 
+ sub BabyRegex::Element::cry_atomic {
+   my $self = shift;
+   my $opcode = shift;
+   my $id = $self->id;
+   
+   if ($self->quant eq "*") {                  
+       my $nextid = $self->next()->id();
+       cry $id, "pushIndex $nextid";
+       cry $opcode;
+       cry "branch $id";
+ 
+   } elsif ($self->quant eq "+" ) {
+       my $nextid = $self->next()->id();
+       cry $id, $opcode;
+       cry "pushIndex $nextid";
+       cry "branch $id";       
+ 
+   } elsif ($self->quant eq "?" ) {
+       my $nextid = $self->next()->id();
+       cry $id, "pushIndex $nextid";
+       cry $opcode;  
+   }
+   else {
+       cry $id, $opcode;
+   }   
+ }
+ 
+ ##
+ ## each element pasm 
+ ##
+ 
+ 
+ 
+ sub BabyRegex::anchor::pasm {
+   my $self = shift;
+   my $type = $self->{TEXT};
+   print $type;
+ }
+ 
+ 
+ sub BabyRegex::macro::pasm { die "unimplemented\n"; }
+ 
+ 
+ sub BabyRegex::oct::explanation {  
+   die "unimplemented - too lazy\n";
+ }
+ 
+ sub BabyRegex::hex::explanation {
+   die "unimplemented - too lazy\n";
+ }
+ 
+ sub BabyRegex::utf8hex::explanation {
+   die "unimplemented - too lazy\n";
+ }
+ 
+ sub BabyRegex::ctrl::explanation {
+   die "unimplemented - too lazy\n";
+ }
+ 
+ sub BabyRegex::named::explanation {
+   die "unimplemented - too lazy\n";
+ }
+ 
+ sub BabyRegex::Cchar::explanation {
+   die "unimplemented - too lazy\n";
+ }
+ 
+ 
+ sub BabyRegex::any::pasm {
+   my $self = shift;  
+   my $l;
+   my $id = $self->id;
+     
+   $self->cry_atomic ("reAnything");
+ }
+ 
+ 
+ sub BabyRegex::text::pasm {
+   my $self = shift;
+   my $text = $self->text;
+   
+   $text =~ s/\n/\\n/g;
+   $text =~ s/\r/\\r/g;
+   $text =~ s/\t/\\t/g;
+   $text =~ s/\f/\\f/g;
+   $text =~ s/'/\\'/g;
+   
+   my $id = $self->id();
+ 
+   $self->cry_atomic ("reLiteral \"$text\"");
+   
+ }
+ 
+ 
+ sub BabyRegex::class::pasm {  
+   my $self = shift;
+   my $text = substr($self->text,1,-1);  
+   $self->cry_atomic ("reOneOf \"$text\""); 
+ }
+ 
+ 
+ sub BabyRegex::alt::pasm {
+   my $self = shift;
+   my $id = $self->id();
+   my $endofgroup_id = $self->{GROUP}->{CLOSED}->id;
+   
+   cry("branch $endofgroup_id");
+ 
+ }
+ 
+ 
+ sub BabyRegex::slash::pasm {  die "unimplemented\n"; }
+ 
+ 
+ sub BabyRegex::group::pasm{
+   my $self = shift;
+   
+   my $id = $self->id;
+   my $cnt = $self->{COUNT};  
+   my $fs = $self->fullstring;  
+   if (length($fs) > 30) {$fs = substr($fs,1,30) . "..."} 
+   
+   cry $id, "#start of n.c. group $cnt        $fs";
+ 
+   if ($self->quant eq "*" or $self->quant eq "?") {
+       cry "pushIndex ". $self->{CLOSED}->next->id();  
+   }
+  
+   foreach my $alt (@{$self->{ALTS}}) {
+       cry "pushIndex " . $alt->next->id();
+   }
+   
+ }
+ 
+ 
+ sub BabyRegex::capture::pasm {
+     
+   # Hei, we are not capturing yet! 
+   
+   my $self = shift;
+   my $id = $self->id;
+   my $cnt = $self->{COUNT};
+ 
+   my $fs = $self->fullstring;  
+   if (length($fs) > 30) {$fs = substr($fs,1,30) . "..."} 
+   
+   
+   if ($self->quant eq "*" or $self->quant eq "?") {
+       cry "pushIndex ". $self->{CLOSED}->next->id();  
+   }
+       
+   cry $id, "#start of group $cnt             $fs";
+       
+   foreach my $alt (@{$self->{ALTS}}) {
+       cry "pushIndex ". $alt->next->id();
+   }
+ }
+ 
+ 
+ sub BabyRegex::close::pasm {
+   my $self = shift;  
+   my $id = $self->id;
+   my $cnt = $self->{GROUP}->{COUNT};
+ 
+   cry $id, "#end of group $cnt";
+   
+   if ($self->{GROUP}->quant eq "*" or $self->{GROUP}->quant eq "+") {
+       cry "pushIndex " .  $self->next->id();
+       cry "branch " . $self->{GROUP}->id;
+       }
+ 
+   
+ }
+ 
+ 
+   
+ sub BabyRegex::comment::pasm { }
+ 
+ sub BabyRegex::whitespace::pasm{ }
+ 
+ 
+ sub BabyRegex::lookahead::explanation { die "unimplemented\n"; }
+ 
+ sub BabyRegex::lookbehind::explanation { die "unimplemented\n"; }
+ 
+ sub BabyRegex::code::pasm {  die "unimplemented\n"; }
+ 
+ sub BabyRegex::later::pasm { die "unimplemented\n"; }
+ 
+ sub BabyRegex::conditional::pasm { die "unimplemented\n"; }
+ 
+ sub BabyRegex::cut::pasm { die "unimplemented\n"; }
+ 
+ sub BabyRegex::flags::pasm{ die "unimplemented\n"; }
+ 
+ sub BabyRegex::backref::pasm { die "unimplemented \n"; }
+ 
+ 
+ 
+ 
+ 
+ 1;
+ 
+ __END__
+ 
+ =head1 NAME
+ 
+ BabyRegex - compiles a regular expression down to Parrot bytecode
+ 
+ =head1 SYNOPSIS
+ 
+   use BabyRegex;
+   BabyRegex->new($REx)->pasm;
+ 
+ =head1 SEE ALSO
+ 
+ The C<YAPE::Regex> documentation.
+ 
+ =head1 AUTHOR
+ 
+   Angel Faus
+   [EMAIL PROTECTED]
+   
+   Based in YAPE::Regex::Explain by Jeff Pinyan ([EMAIL PROTECTED])
+ 
+ =cut
+ 
+ 
diff -crN parrot_current/parrot/languages/regex/regexc.pl 
parrot_patched/parrot/languages/regex/regexc.pl
*** parrot_current/parrot/languages/regex/regexc.pl     Wed Dec 31 19:00:00 1969
--- parrot_patched/parrot/languages/regex/regexc.pl     Sat Nov 10 16:21:43 2001
***************
*** 0 ****
--- 1,10 ----
+ #!/usr/bin/perl
+ 
+ use BabyRegex;
+ 
+ die 'usage: perl regexc.pl "pattern" ' unless @ARGV[0];
+ $pattern = @ARGV[0];
+ 
+ $c = BabyRegex->new($pattern);
+ $c->pasm();
+ 
diff -crN parrot_current/parrot/t/op/re.t parrot_patched/parrot/t/op/re.t
*** parrot_current/parrot/t/op/re.t     Wed Dec 31 19:00:00 1969
--- parrot_patched/parrot/t/op/re.t     Sat Nov 10 16:09:24 2001
***************
*** 0 ****
--- 1,108 ----
+ #!perl -w
+ 
+ use Parrot::Test tests => 6;
+ 
+ output_is(<<'CODE', "1", "A is A");
+       reMatch RE, "A"
+       print I0
+       end
+ 
+ RE:
+         reOnFail $fail
+         reLiteral "A"
+       set I0, 1
+       reFinished
+ $fail:
+       set I0, 0
+       reFinished
+ 
+ CODE
+ 
+ output_is(<<'CODE', 0, "A is not B");
+         reMatch RE, "A"
+       print I0
+       end
+ 
+ RE:
+         reOnFail $fail
+         reLiteral "B"
+       set I0, 1
+         reFinished
+ $fail:
+       set I0, 0
+         reFinished
+ CODE
+ 
+ output_is(<<'CODE', 1, "advance-on-start works okay");
+       reMatch RE, "bab"
+       print I0
+       end
+ RE:
+         reOnFail $fail
+ $start:
+       rePushindex, $advance
+       reLiteral "a"
+       set I0, 1
+       reFinished
+ $fail:
+       set I0, 0
+       reFinished
+ $advance:
+       reAdvance
+         branch $start
+ 
+ CODE
+ 
+ output_is(<<'CODE', 0, "advance-on-start fails okay");
+       reMatch RE, "bxb"
+       print I0
+       end
+ 
+ RE:
+         reOnFail $fail
+ $start:
+       rePushindex, $advance
+       reLiteral "a"
+       set I0, 1
+       reFinished
+ $fail:
+       set I0, 0
+       reFinished
+ $advance:
+       reAdvance
+         branch $start
+ CODE
+ 
+ output_is(<<'CODE', "10", "dot works okay");
+       reMatch RE, "a"
+       print I0
+       reMatch RE, "\n"
+       print I0
+       end
+ 
+ RE:
+         reOnFail $fail
+         reAnything
+       set I0, 1
+       reFinished
+ $fail:
+       set I0, 0
+       reFinished
+ CODE
+ 
+ output_is(<<'CODE', "10", "character class works okay");
+       reMatch RE, "a"
+       print I0
+       reMatch RE, "z"
+       print I0
+       end
+ 
+ RE:
+         reOnfail $fail
+         reOneof "abc"
+       set I0, 1
+       reFinished
+ $fail:
+       set I0, 0
+       reFinished
+ CODE

Re: A serious stab at regexes

Reply via email to