Hi Folks,

I'm trying to strip C/C++ style comments (/* ... */  or // ) from
source code using Python regexps.

If I don't have to worry about comments embedded in strings, it seems
pretty straightforward (this is what I'm using now):

cpp_pat = re.compile(r"""
/\* .*? \*/ |                # C comments
// [^\n\r]*                  # C++ comments
""",re.S|re.X)
s = file('myprog.cpp').read()
cpp_pat.sub(' ',s)

However, the sticking point is dealing with tokens like /* embedded
within a string:

const char *mystr =  "This is /*trouble*/";

I've inherited a working Perl script, which I'd like to reimplement in
Python so that I don't have to spawn a new Perl process in my Python
program each time I want to strip comments from a file. The Perl script
looks like this:

#!/usr/bin/perl -w

$/ = undef;                     # no line delimiter
$_ = <>;                        # read entire file

s! ((['"]) (?: \\. | .)*? \2) | # skip quoted strings
   /\* .*? \*/ |                # delete C comments
   // [^\n\r]*                  # delete C++ comments
 ! $1 || ' '                    # change comments to a single space
 !xseg;                         # ignore white space, treat as single line
                                # evaluate result, repeat globally
print;

The Perl regexp above uses some sort of conditional  to deal with this,
by replacing a quoted string with itself if the initial match is a
quoted string. Is there some equivalent feature in Python regexps?

Lorin

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to