Comment #1 on issue 183 by akvadr...@gmail.com: HTMLSanitizer can't be used as a tokenizer
http://code.google.com/p/html5lib/issues/detail?id=183

This is a workaround and slightly safer design. There is no need for the mixin or to hardcode the __init__ arguments:


from html5lib import HTMLParser
from html5lib.tokenizer import HTMLTokenizer
from html5lib.sanitizer import HTMLSanitizerMixin
from cgi import escape

class Sanitizer(HTMLTokenizer):
    def __init__(self, *a, **kw):
        HTMLTokenizer.__init__(self, *a, **kw)
        self._saner = HTMLSanitizerMixin()

    def __iter__(self):
        for token in HTMLTokenizer.__iter__(self):
            saner = self._saner.sanitize_token(token)
            if saner: yield saner

PARSER = HTMLParser(tokenizer=Sanitizer)

def sanitize(html):
    return PARSER.parseFragment(html).toxml()


--
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss@googlegroups.com.
To unsubscribe from this group, send email to 
html5lib-discuss+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB.

Reply via email to