Re: Stripping scripts from HTML with regular expressions

Paul McGuire Thu, 10 Apr 2008 08:28:28 -0700

On Apr 9, 2:38 pm, Michel Bouwmans <[EMAIL PROTECTED]> wrote:
> Hey everyone,
>
> I'm trying to strip all script-blocks from a HTML-file using regex.
>
> I tried the following in Python:
>
> testfile = open('testfile')
> testhtml = testfile.read()
> regex = re.compile('<script\b[^>]*>(.*?)</script>', re.DOTALL)
> result = regex.sub('', blaat)
> print result
>
> This strips far more away then just the script-blocks. Am I missing
> something from the regex-implementation from Python or am I doing something
> else wrong?
>
> greetz
> MFB


This pyparsing-based HTML stripper (http://pyparsing.wikispaces.com/
space/showimage/htmlStripper.py) strips *all* HTML tags, scripts, and
comments.  To pare down to just stripping scripts, just change this
line:

firstPass = (htmlComment | scriptBody | commonHTMLEntity |
             anyTag | anyClose ).transformString(targetHTML)

to:

firstPass = scriptBody.transformString(targetHTML)

-- Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Stripping scripts from HTML with regular expressions

Reply via email to