Hi again. So after much reading and consideration, I decided I'd like
to try html5lib. Unfortunately, it seems that the current version
html5lib-0.11.1.zip (http://code.google.com/p/html5lib/downloads/list)
uses setuptools, whose current version is setuptools-0.6c9.win32-
py2.5.exe (http://pypi.python.org/pypi/setuptools). But I'm running
Python 2.6. I haven't had much experience with adding packages to
Python, so if there's a workaround for this (aside from using Python
2.5), could somebody please share. Thanks!

On Feb 24, 10:09 am, Michael Repucci <[email protected]> wrote:
> Hi Everybody, I just wanted to give a HUGE THANKS to everyone
> participating in this discussion. It's exactly the kind of information
> I was hoping I could get from all of you; more than enough to keep a
> newbie like me occupied on the topic for quite some time. So thanks
> again for sharing your knowledge and experience. :)
>
> On Feb 24, 9:14 am, Brian Neal <[email protected]> wrote:
>
> > On Feb 23, 10:51 pm, Jacob Kaplan-Moss <[email protected]>
> > wrote:
>
> > > On Mon, Feb 23, 2009 at 7:49 PM, Brian Neal <[email protected]> wrote:
> > > > Interesting, I've also come across this:
>
> > > >http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html
>
> > > > I've heard it is very fast as it is just a python binding to a C-
> > > > library...?
>
> > > Short version: don't use lxml.html.clean, either.
>
> > > Long version: yes, lxml is built on top of libxml2 so it is indeed
> > > *very* fast. Probably as much as an order of magnitude faster than
> > > html5lib.
>
> > > However, if you look at the source of lxml.html.clean
> > > (http://codespeak.net/lxml/api/lxml.html.clean-pysrc.html) you'll see
> > > its implemented in terms of a blacklist. This is almost always a bad
> > > idea: you only have to miss *one thing* on your blacklist to make your
> > > site as insecure as if you'd not bothered escaping HTML at all. IOW,
> > > with a blacklist you'd be on constant defense. Remember how early spam
> > > protection systems just blocked spammers email addresses? How'd that
> > > work out, anyway?
>
> > > Also... the FIXMEs in that code doesn't exactly inspire confidence.
>
> > > No nock against lxml here -- it's an incredible toolkit, and I use it
> > > all of the place for general XML and HTML parsing. But security is
> > > *hard* stuff; it's worth being paranoid about your tools.
>
> > I did start to use lxml.html.clean on my project. I tested it (very
> > casually) and it seemed to work just fine. However, in one spot in my
> > code I needed finer control over what tags to allow. According to the
> > docs and the options, it looked to me like you could operate it with a
> > white list. However this didn't work out in practice. The options you
> > give to the cleaner are confusing and seem to contradict each other. I
> > couldn't get it to do what I wanted. I asked about this on the mailing
> > list and it was conceeded that the options didn't work together very
> > well. I also studied the source code a bit and came to the same
> > conclusion.
>
> > I then turned to using Markdown and recalibrated my opinion on what to
> > allow as input from the user.
>
> > Thanks for the link to html5lib though. I will keep that in my back
> > pocket.
>
> > BN
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to