Comment #1 on issue 183 by akvadr...@gmail.com: HTMLSanitizer can't be used as a tokenizer
http://code.google.com/p/html5lib/issues/detail?id=183
This is a workaround and slightly safer design. There is no need for the mixin or to hardcode the __init__ arguments:
from html5lib import HTMLParser from html5lib.tokenizer import HTMLTokenizer from html5lib.sanitizer import HTMLSanitizerMixin from cgi import escape class Sanitizer(HTMLTokenizer): def __init__(self, *a, **kw): HTMLTokenizer.__init__(self, *a, **kw) self._saner = HTMLSanitizerMixin() def __iter__(self): for token in HTMLTokenizer.__iter__(self): saner = self._saner.sanitize_token(token) if saner: yield saner PARSER = HTMLParser(tokenizer=Sanitizer) def sanitize(html): return PARSER.parseFragment(html).toxml() -- You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To post to this group, send an email to html5lib-discuss@googlegroups.com. To unsubscribe from this group, send email to html5lib-discuss+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.