>Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable >specification.
So maybe it's time for a new tool. >Ironically enough it is about 10 times slower than ye olde Opera HTML5 >parser at actually parsing html. :) Yes, but I believe it was written to search for specific tags, not parse every single tag or even to build a datastructure around it. So it's naturally pretty bad at anything not RXML (as RXML were at the time, too, probably) :) >I have seriously considered writing one. But the name 'Parser.HTML' is >already taken. :) Which is bad. But it shouldn't be the largest obstacle. :) Use a subtree. Parser.HTML.Tokenizer?