>I would like something that breaks down an html document to a >datastructure, preferable one with tools like searching.
Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable specification. >Parser.HTML is created to allow RXML (or similar) parsing with as >little computron usage as possible. What I am using it for mostly is >breaking down random HTML documents for data gathering, which isn't >the intended use... Ironically enough it is about 10 times slower than ye olde Opera HTML5 parser at actually parsing html. :) It is faster to have a simple tokenizer that then outputs tokens that is handled by either a tree generator (as also specified in html5) or somethgin that just calls callbacks for tags (like the current Parser.HTML) I have seriously considered writing one. But the name 'Parser.HTML' is already taken. :) Handy reference: https://html.spec.whatwg.org/multipage/syntax.html#tokenization -- Per Hedbor