Re: html sanitizers

Horst Gutmann Fri, 13 Jul 2007 05:07:39 -0700

Brett Parker wrote:
> On Fri, Jul 13, 2007 at 11:48:50AM +0100, Nic James Ferrier wrote:
>> Brett Parker <[EMAIL PROTECTED]> writes:
>>
>>> On Fri, Jul 13, 2007 at 11:18:18AM +0100, Nic James Ferrier wrote:
>>>> Derek Anderson <[EMAIL PROTECTED]> writes:
>>>>
>>>>> hey all,
>>>>>
>>>>> could anyone point me to a python html sanitizer implementation (or 
>>>>> example)?  i don't mean to strip all html, just tags and attributes not 
>>>>> on a whitelist, such as I/B/A href/U/etc.
>>>> I use libxml2/libxslt, something like:
>>>>
>>>>   doc = libxml2.htmlParseDoc(whatever, "utf8")
>>>>   result = libxslt.applyStylesheetFile(doc, "strip.xslt", {})
>>>>
>>>> There are loads of ways of stripping in xslt depending on what you
>>>> want to do.
>>> Only works on well formed XHTML documents though... which although they
>>> should be the norm, really aren't!
>> No. In my example I deliberately used libxml2' HTML parser which is an
>> HTML parser not an XHTML parser.
>>
>> It copes with non-well formed documents as well as all the usual
>> entity problems.
> 
> Ohhh, so you did - sorry - eyes still blurry from sleep deprivation!
> 
> Cheers,


I used BeautifulSoup_ for this job which is also a generic HTML parser
and provides some easy ways to remove certain HTML elements from the
tree while keeping some in.

The problem with whatever system you use, is the huge amount of ways to
inject XSS attacks into HTML thanks to problems in the various browser
engines. A few weeks I found a nice listing (nice looooong listing) but
can't remember the URL anymore :-/

.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: html sanitizers

Reply via email to