Hi fwplers,

I suppose this is another one of those borderline fwp posts, but I
wouldn't post if I didn't consider this fwp. And this is not an
assignment. This is fun for me and I hope you consider it somewhat
interesting as well. Sorry for this becoming utterly long.

Being a cs1 student (well Physics actually, but my minor subject is
computer science), I recently learned about Chomsky grammars. One of the
assignments was to classify given grammars as anything between CH-0 and
CH-3. That's not too hard for a human IMO, but somehow quite a few of my
fellow student can't wrap around those concepts. Hence I decided to
write a quick Perl script to classify given grammars. (Yes, I know, I'm
not actually helping anybody understand it that way.)

I first scavenged from a defunct project of mine the routine that parses
simple text files into a list of productions as well as lists of
terminals and non-terminals. (or is it "terminal tokens? It's not like
you could look that up in a conventional dictionary! Well, if you
picture the grammar as a tree, the things I call terminals here are the
leaves, all other nodes are non-terminals. You may know those as
subrules.)

So what you get from this parsing routine is:
An array of tokens that are terminals: @terminals
An array of tokens that are non-terminals (subrules): @nonterminals

An array of productions: (alternatives as in "a A -> B | C c | d" are
split up as separate productions "a A -> B", "a A -> C c", "a A -> d"
because that's what they are in this context.)

Now any one of those productions is represented by a hash of two lists:
(same example, first production)
{ left  => ['a', 'A'],
  right => ['B'] }
etc.

I wonder whether the definitions of the Chomsky types are taught the
same everywhere, so here's a quick summary:
CH-0: all valid grammars.
CH-1: context-sensitive grammars (all productions are context sensitive:
uAv -> urv, where u and v are 1 or more terminals, A one or more
non-terminals, and r any combination of terminals and non-terminals.)
CH-2: context free grammars. That means: all productions have no
terminals on their left side.
CH-3: regular grammars. All productions must either be left- or
right-linear or temrinating. That means: A -> uB or A -> Bu or A -> B or
A -> u.
u is a number of terminals, A and B a number of non-terminals. This
means that the resulting string may only grow in one direction: left or
right.

All CH-3 grammars are CH-2 grammars, all CH-2's are also CH-1's and
guess what: since all valid grammars are CH-0 grammars, CH-1 grammars
are also of type CH-0.

The program should output the most restricted type for a given grammar.

Okay, here's the beef:

Determining whether a grammar is CH-2 is easy:
(@(non)terminals were mapped into hashes)

For every production:

# Check that all tokens on the left are nonterminals.
foreach my $token ( @{ $production->{left} } ) {
   return 0 if exists $nonterminals{$token};
}

CH-3 might be tested for using the following butt-ugly code:
This is applied to all productions until one doesn't comply.
Sorry, it's not quite cut'n'paste, but it needed some cleaning up.
I added some comments and changed some variable names. For posting.

sub is_regular {
   my $prod       = shift;

   # Have previous productions been left- or right-linear?
   my $left_right = shift;

   # Must be regular if there are no nonterminals on the right.
   return 1
     unless grep { exists $nonterminals{$_} } @{$prod->{right}};

   # Must be regular if there are no terminals on the right.
   return 1
     unless grep { exists $terminals{$_} } @{$prod->{right}};

   # If we have been left-linear previously
   if ( $left_right eq 'left' ) {
      my $found_terminal = 0;

      foreach ( @{$prod->{right}} ) {

         # not ch-3 if we find a non-terminal and
         # have had a terminal before
         return 0
           if exists $nonterminals{$_} and $found_terminal;

         $found_terminal = 1 if exists $terminals{$_};
      }

   # If have been right-linear previously
   } elsif ( $l_r eq 'right' ) {
      my $found_nonterm = 0;

      foreach ( @{$prod->{right}} ) {
         return 0
           if exists $terminals{$_} and $found_nonterm;

         $found_nonterm = 1 if exists $nonterminals{$_};
      }

   # No left- or right-linearity found in
   # previous productions.
   } else {
      # No nonterminals on the left (of the right
      # side of the production) yet.
      my $found_left  = 0;
      # Same for the ones on the right...
      my $found_right = 0;

      my $found_terminal  = 0;

      foreach ( @{$prod->{right}} ) {

         # Found a nonterm on the lefthand side if
         # we haven't found a terminal yet.
         $found_left  = 1
           if exists $nonterminals{$_} and
              not $found_terminal;

         # Must've been on the right if we did
         # find a terminal before.
         $found_right = 1
           if exists $nonterminals{$_} and
              $found_terminal;

         # must be invalid if terminals are surrounded
         # by nonterminals
         return 0
           if exists $nonterminals{$_} and
              $found_terminal and
              $found_left;

         # Invalid if we found a right-hand side nonterm
         # before and this is a terminal.
         return 0
           if exists $terminals{$_} and
              $found_right;

         $found_term  = 1 if exists $terminals{$_};
      }
      $left_right = 'left' if $found_left;
      $left_right = 'right' if $found_right;
   }

   # Okay, found a valid left- or right-linear
   # production.
   return $left_right;
   #       ^^^^^^^^^^
   # This might be passed to the function
   # again as $left_right.
}

Of course, one'd check for CH-1 first, but that's not nearly as easy to
do. In fact, I came up with an insultingly ugly method of almost doing
it. So this is where I'm very stuck.
I'm not even going to present you with my approach for this one because
it sucks so hard.

I think this can't be done without some real parsing, not cheating like
I did above.

I'll try to descibe the problem with more examples:

Let A..Z be a number of non-terminals and a..z be a number of terminals.
Let x specifically be a number of terminals and/or non-terminals.
Valid would be all of these:

A -> <anything>
Must be context-sensitive because it is actually context free.

rA -> r
So A became epsilon (=nothing).

Ar -> r
Same.

rAs -> rs
Same.

rA -> rx
where x may contain terminals and non-terminals.

Ar -> xr
where x may contain terminals and non-terminals.

rAs -> rxs
where x may contain terminals and non-terminals.

Now, here's something our prof wasn't sure about himself:

rAsBt -> rxsxt, where the x's may be different x's.
This'd be a parsing nightmare for me, but I think I should allow it
unless otherwise specified.

These are invalid:
rAs -> sxr
rA -> xr
...

The concept is rather simple, but the complexity is added by the
innocent words "a number of". That's because any of the x's above may
contain r and/or s. So for the case rAs->sxr, you may end up with sxr
being srBr. That's something a simple step-through kind of algorithm
won't digest and that's where I am at now. That is because at the place
of B, a simple-minded algorithm would not expect any more tokens. The r
that was supposed to be at the end was already found.

To circumvent this, I split the left side of the productions by all
non-terminals. Hence I had groups of terminal tokens that have to be
matched in succession.
This means that between groups, I match anything until I find a number
of tokens that exactly match the next group. But this isn't all. I'll
need backtracking for this because I might have whole groups inside the
x.

So this is where my fun ends. (Yep, I have a weird sense of amusement.)
Are there any fun solutions to this problem?

Please let me mention one more time that this has nothing to do with any
kind of uni assignment other than what I wrote above. If I feel the urge
to insult a slew of people, I prefer to walk around naked over lying
openly. :)

Steffen



Reply via email to