On Apr 9, 2:38 pm, Michel Bouwmans <[EMAIL PROTECTED]> wrote:
> Hey everyone,
>
> I'm trying to strip all script-blocks from a HTML-file using regex.
>
> I tried the following in Python:
>
> testfile = open('testfile')
> testhtml = testfile.read()
> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
> result = regex.sub('', blaat)
> print result
>
> This strips far more away then just the script-blocks. Am I missing
> something from the regex-implementation from Python or am I doing something
> else wrong?
>
> greetz
> MFBThis pyparsing-based HTML stripper (http://pyparsing.wikispaces.com/ space/showimage/htmlStripper.py) strips *all* HTML tags, scripts, and comments. To pare down to just stripping scripts, just change this line: firstPass = (htmlComment | scriptBody | commonHTMLEntity | anyTag | anyClose ).transformString(targetHTML) to: firstPass = scriptBody.transformString(targetHTML) -- Paul -- http://mail.python.org/mailman/listinfo/python-list
