Re: libhubbub parse error on google homepage

Dean Mao Fri, 26 Oct 2012 02:45:16 -0700

Btw, thanks for the help on this.  The project is for a nodejs native
extension that brings the love of libhubbub to the nodejs world:


https://github.com/deanmao/node-hubbub

There are other html parsers in the nodejs world, but none are as good as
libhubbub.  I considered using the parser from webkit or firefox, but
libhubub was definitely the easiest to use since it was completely
standalone and used very few external libraries.


On Wed, Oct 24, 2012 at 4:14 AM, Dean Mao <[email protected]> wrote:

> I see, thanks for the tip.  I'm only using it for the tokeniser as I don't
> have use for a dom tree.  All I did was perform this when I saw a script
> tag:
>
>   hubbub_tokeniser_optparams params;
>   params.content_model.model = HUBBUB_CONTENT_MODEL_CDATA;
>   hubbub_tokeniser_setopt(tok_, HUBBUB_TOKENISER_CONTENT_MODEL, &params);
>
> Then revert it back when I see the end of the script tag.  It seemed like
> that was what in_head.c was doing with parse_generic_rcdata().
>
>
> On Wed, Oct 24, 2012 at 3:27 AM, John-Mark Bell 
> <[email protected]>wrote:
>
>> On Wed, Oct 24, 2012 at 02:54:49AM -0700, Dean Mao wrote:
>> > Here's a more compact test:
>> >
>> > <script>for(var i=0;i<n;i++);</script>
>> >
>> > Outputs:
>> >
>> > START TAG: 'script'
>> > CHARACTERS: 'for(var i=0;i'
>> > START TAG: 'n;i++);<' attributes:
>> > 'script' = ''
>> >
>> > Essentially everything inside a <script> tag should be treated as
>> > characters until a </script> tag is seen.
>>
>> Yes. This behaviour you're seeing is expected. The HTML5 tokeniser has a
>> number of modes, which are selected by the token handler callback
>> provided by the client. The trivial token handler in test/tokeniser.c
>> does not manipulate the tokeniser mode, thus it does not handle the
>> contents of script (and other, similar) elements in the expected fashion.
>>
>> The treebuilder implementation in Hubbub does manipulate the tokeniser
>> mode in the correct way. In most cases, you'll want to use the built-in
>> treebuilder, as it handles all the complexity of coping with junk input
>> for you. See examples/libxml.c for a demonstration of how to use the
>> built-in treebuilder.
>>
>> If you do only wish to use the tokeniser, then you need to ensure that
>> your token handler changes the tokeniser mode in the same way that an
>> HTML5 treebuilder would.
>>
>>
>> J.
>>
>
>

Re: libhubbub parse error on google homepage

Reply via email to