testsuite

Per Hedbor () @ Pike (-) developers forum Wed, 17 Sep 2014 02:40:44 -0700

>I would like something that breaks down an html document to a
>datastructure, preferable one with tools like searching.


Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable
specification.

>Parser.HTML is created to allow RXML (or similar) parsing with as
>little computron usage as possible. What I am using it for mostly is
>breaking down random HTML documents for data gathering, which isn't
>the intended use...

Ironically enough it is about 10 times slower than ye olde Opera HTML5
parser at actually parsing html. :)

It is faster to have a simple tokenizer that then outputs tokens
that is handled by either a tree generator (as also specified in
html5) or somethgin that just calls callbacks for tags (like the
current Parser.HTML)

I have seriously considered writing one. But the name 'Parser.HTML' is
already taken. :)

Handy reference: 

https://html.spec.whatwg.org/multipage/syntax.html#tokenization

-- 
Per Hedbor

testsuite

Reply via email to