Brett Parker wrote: > On Fri, Jul 13, 2007 at 11:48:50AM +0100, Nic James Ferrier wrote: >> Brett Parker <[EMAIL PROTECTED]> writes: >> >>> On Fri, Jul 13, 2007 at 11:18:18AM +0100, Nic James Ferrier wrote: >>>> Derek Anderson <[EMAIL PROTECTED]> writes: >>>> >>>>> hey all, >>>>> >>>>> could anyone point me to a python html sanitizer implementation (or >>>>> example)? i don't mean to strip all html, just tags and attributes not >>>>> on a whitelist, such as I/B/A href/U/etc. >>>> I use libxml2/libxslt, something like: >>>> >>>> doc = libxml2.htmlParseDoc(whatever, "utf8") >>>> result = libxslt.applyStylesheetFile(doc, "strip.xslt", {}) >>>> >>>> There are loads of ways of stripping in xslt depending on what you >>>> want to do. >>> Only works on well formed XHTML documents though... which although they >>> should be the norm, really aren't! >> No. In my example I deliberately used libxml2' HTML parser which is an >> HTML parser not an XHTML parser. >> >> It copes with non-well formed documents as well as all the usual >> entity problems. > > Ohhh, so you did - sorry - eyes still blurry from sleep deprivation! > > Cheers,
I used BeautifulSoup_ for this job which is also a generic HTML parser and provides some easy ways to remove certain HTML elements from the tree while keeping some in. The problem with whatever system you use, is the huge amount of ways to inject XSS attacks into HTML thanks to problems in the various browser engines. A few weeks I found a nice listing (nice looooong listing) but can't remember the URL anymore :-/ .. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---