C Code Parser Using Recursive Descent

Rahul Jain Fri, 09 Oct 2009 05:12:40 -0700

Hi All,
 
I am working on a C code parser, in which one of my requirements is to parse 
the C source and header files and calculate the Lines of Code. Though there are 
tools to do so, all of them have a problem wherein they treat function 
definitions and declarations with argument list on multiple lines as multiple 
lines rather than single line . For example 
 
- 1-
 
/* Definition */
int Test ( int x,
              int y
           )
{
}
 
or
 
/* Declaration */ 
int Test( int x,
             int y
           );
 
is treated as 5 and 3 lines respectively rather than 3 and 1 line, i.e it 
should be treated as


 

- 2 -  
/* Definition */
int Test ( int x, int y)
{
}
 

/* Declaration */ 
int Test( int x, int y );
 
To fix this I plan to make some modifications in my perl tool. I wish to use 
RecDescent to parse the input file, identify such constucts and then use perl 
script to convert these multiple line construct into single line contructs, so 
if construct - 1 - is given as input to the script then the output should be - 
2 -. I found a script by Damian Conway, Helmut Jarausch and Teodor Zlatanov 
which uses the RecDescent to seperate comments from the c code.(also attached 
with the mail) The grammar used is 
 

C_code : m{( 
                      [^"/]+      # one or more non-delimiters
                      (            # then (optionally)...
                      /            # a potential comment delimiter
                      [^*/]       # which is not an actual delimiter
                      )?          # 
                     )+          # all repeated once or more
                 }x
                 { $Code .= $item[1] }
 
comment : m{   \s*              # optional whitespace
                       //                # comment delimiter
                       [^\n]*           # anything except a newline
                       \n               # then a newline
                    }x
                    {  $Code .= "\n"; $Comments .= $item[1] }
 
                    |   m{\s*                   # optional whitespace
                          /\*                     # comment opener
                          (?:[^*]+|\*(?!/))*   # anything except */
                          \*/                     # comment closer
                          ([ \t]*)?              # trailing blanks or tabs
                       }x 
                    { $Code .= " "; $Comments .= $item[1] }
 
I want to use the same metodology but rather than seperating the comments from 
the C code I want to use a grammar to identify such constructs and if any such 
construct is found covert them into the required output.
 
So please could someone help me with the grammar than can be used to identify 
the constructs and the way I can convert it into a single line.
 
Thanks in advance,
 
Regards
Rahul Jain
HCL Technologies Ltd.

#! /usr/bin/perl -w
# stat-comments.pl by Teodor Zlatanov, t...@iglou.com
# March 26, 2000

# A script to evaluate the readability of comments
# embedded in C++.  Utilizes code from demo-decomment.pl,
# which is included with the Parse::RecDescent module.
# Uses the Lingua::EN::Fathom module to evaluate text
# readability.

# ORIGINAL BY Helmut Jarausch 
# EXTENDED BY Damian Conway AND Helmut Jarausch
# POLISHED BY Teodor Zlatanov


use strict;
use Parse::RecDescent;
use Lingua::EN::Fathom;

use vars qw/ $Grammar /;

my $parser = new Parse::RecDescent $Grammar  or  die "invalid grammar";

undef $/;
my $text = @ARGV ? <> : <DATA>;

my $parts = $parser->program($text) or die "malformed C program";

# only work with comments of length > 0
die "No comments found in input" unless length $parts->{comments};

# convert every comment mark to a period, so separate comments are
# separate sentences, if well-formed.  Lingua::EN::Fathom is quite
# good at figuring out what sentences are valid, so an extra period
# in the text won't affect the overall counts.

$parts->{comments} =~ s#//#. #g;
$parts->{comments} =~ s#/\*#. #g;
$parts->{comments} =~ s#\*/#. #g;

# we can now evaluate the comments (stored in $parts->{comments})
my $fathom = new Lingua::EN::Fathom; 
$fathom->analyse_block($parts->{comments});

# voila, the readability report!
print($fathom->report);
  
BEGIN
{ $Grammar=<<'EOF';

program : <rulevar: local $WithinComment=0>
program : <rulevar: local $Comments = ""> /this shouldn't be here :-/
program : <reject>
program : <reject> /with prejudice/
program : <rulevar: local $Code = "">
program : <rulevar: local @Strings>

program : <skip:''> part(s)
                { { code=>$Code, comments=>$Comments, strings=>[...@strings]} }

part    : comment
        | C_code
        | string

C_code  : m{(                   
              [^"/]+            # one or more non-delimiters
              (                 # then (optionally)...
               /                # a potential comment delimiter
               [^*/]            # which is not an actual delimiter
              )?                # 
            )+                  # all repeated once or more
           }x
                { $Code .= $item[1] }

string  : m{"                   # a leading delimiter
            ((                  # zero or more...
              \\.               # escaped anything
              |                 # or
              [^"]              # anything but a delimiter
             )*
            )
            "}x
                { $Code .= $item[1]; push @Strings, $1 }


comment : m{\s*                 # optional whitespace
            //                  # comment delimiter
            [^\n]*              # anything except a newline
            \n                  # then a newline
           }x
                { $Code .= "\n"; $Comments .= $item[1] }

        | m{\s*                 # optional whitespace
            /\*                 # comment opener
            (?:[^*]+|\*(?!/))*  # anything except */
            \*/                 # comment closer
            ([ \t]*)?           # trailing blanks or tabs
           }x   
                { $Code .= " "; $Comments .= $item[1] }

EOF
}
__DATA__
program test; // for decomment

// using Parse::RecDescent

/*
 We should raise the indices quite a bit with this text section,
 because it will actually include sentences and structure.  See,
 the problem with most C/C++ programs is that they use comments
 that are very short and convey little information.
*/
 
int main()
{
/* this should
   be removed
*/
  char *cp1 = "";
  char *cp2 = "cp2";
  int i;  // a counter
          // remove this line altogehter
  int k;  
      int more_indented;  // keep indentation
      int l;  /* a loop
             variable */
      // should be completely removed

  char *str = "/* ceci n'est pas un commentaire */";
  return 0;
}

C Code Parser Using Recursive Descent

Reply via email to