Hi Folks, I'm trying to strip C/C++ style comments (/* ... */ or // ) from source code using Python regexps.
If I don't have to worry about comments embedded in strings, it seems pretty straightforward (this is what I'm using now): cpp_pat = re.compile(r""" /\* .*? \*/ | # C comments // [^\n\r]* # C++ comments """,re.S|re.X) s = file('myprog.cpp').read() cpp_pat.sub(' ',s) However, the sticking point is dealing with tokens like /* embedded within a string: const char *mystr = "This is /*trouble*/"; I've inherited a working Perl script, which I'd like to reimplement in Python so that I don't have to spawn a new Perl process in my Python program each time I want to strip comments from a file. The Perl script looks like this: #!/usr/bin/perl -w $/ = undef; # no line delimiter $_ = <>; # read entire file s! ((['"]) (?: \\. | .)*? \2) | # skip quoted strings /\* .*? \*/ | # delete C comments // [^\n\r]* # delete C++ comments ! $1 || ' ' # change comments to a single space !xseg; # ignore white space, treat as single line # evaluate result, repeat globally print; The Perl regexp above uses some sort of conditional to deal with this, by replacing a quoted string with itself if the initial match is a quoted string. Is there some equivalent feature in Python regexps? Lorin -- http://mail.python.org/mailman/listinfo/python-list