On Apr 9, 2:38 pm, Michel Bouwmans <[EMAIL PROTECTED]> wrote: > Hey everyone, > > I'm trying to strip all script-blocks from a HTML-file using regex. > > I tried the following in Python: > > testfile = open('testfile') > testhtml = testfile.read() > regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL) > result = regex.sub('', blaat) > print result > > This strips far more away then just the script-blocks. Am I missing > something from the regex-implementation from Python or am I doing something > else wrong? > > greetz > MFB
This pyparsing-based HTML stripper (http://pyparsing.wikispaces.com/ space/showimage/htmlStripper.py) strips *all* HTML tags, scripts, and comments. To pare down to just stripping scripts, just change this line: firstPass = (htmlComment | scriptBody | commonHTMLEntity | anyTag | anyClose ).transformString(targetHTML) to: firstPass = scriptBody.transformString(targetHTML) -- Paul -- http://mail.python.org/mailman/listinfo/python-list