>I would like something that breaks down an html document to a
>datastructure, preferable one with tools like searching.

Well, yes, a HTML tokenizer would be useful. HTML5 has a very readable
specification.

>Parser.HTML is created to allow RXML (or similar) parsing with as
>little computron usage as possible. What I am using it for mostly is
>breaking down random HTML documents for data gathering, which isn't
>the intended use...

Ironically enough it is about 10 times slower than ye olde Opera HTML5
parser at actually parsing html. :)

It is faster to have a simple tokenizer that then outputs tokens
that is handled by either a tree generator (as also specified in
html5) or somethgin that just calls callbacks for tags (like the
current Parser.HTML)

I have seriously considered writing one. But the name 'Parser.HTML' is
already taken. :)

Handy reference: 

https://html.spec.whatwg.org/multipage/syntax.html#tokenization

-- 
Per Hedbor
    • ... Per Hedbor () @ Pike (-) developers forum
    • ... Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
      • ... Per Hedbor () @ Pike (-) developers forum
        • ... Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
          • ... Per Hedbor () @ Pike (-) developers forum
            • ... Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
            • ... Per Hedbor () @ Pike (-) developers forum
            • ... Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
            • ... Per Hedbor () @ Pike (-) developers forum
            • ... Mirar @ Pike developers forum
            • ... Per Hedbor () @ Pike (-) developers forum
            • ... Mirar @ Pike developers forum
            • ... Jonas Walldén @ Pike developers forum
        • ... Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
          • ... Jonas Walldén @ Pike developers forum
            • ... Marcus Comstedt (ACROSS) (Hail Ilpalazzo!) @ Pike (-) developers forum
  • tes... Mirar @ Pike developers forum
    • ... Per Hedbor () @ Pike (-) developers forum
      • ... Mirar @ Pike developers forum
        • ... Mirar @ Pike developers forum
          • ... Peter Bortas @ Pike developers forum

Reply via email to