1- I am currently working on building and HTML5 parser according to the specs of WHATWG and while testing the tokenizer using the tests on the HTML5Lib i have noticed some of them have bugs. In general these are the major things i have noticed: I am refering to this set of tests :http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/test1.test
There are also some similar stuff in test2 test3 and test4 but lets just stick with test1 for now. I have noticed that in places where you have doctype tokens like this one : {"description":"Correct Doctype lowercase", "input":"<!DOCTYPE html>", "output":[["DOCTYPE", "html", null, null, true]]} The force quirck flag is set to true where as the specifications say its usually on or off. Like the example in the EOF here :http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#doctype-state 2- In places where character tokens follow the tokenizer gives 1 character token with all the character as data when it should give 1 character token for every single character. Here is an example. {"description":"Ampersand ampersand EOF", "input":"&&", "output":[["Character", "&&"]]} My expected output for this is having 2 character tokens each with ampersand data rather than just 1 token. 3- Assuming true stands for on and false for off, many quirck flags are inverted where true(on) is given then it has to be false(off). The earlier case I gave is an example. The states that should be covered with this input are the following: DataState: <!DOCTYPE html> Tag open state: <!DOCTYPE html> Markup deceleration open state: <!DOCTYPE html> Doctype State: : <!DOCTYPE html> Before doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> Doctype name state: <!DOCTYPE html> The state says the following : U+003E GREATER-THAN SIGN (>) Switch to the data state<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state>. Emit the current DOCTYPE token. And then in data state the EOF is read so there is nothing about the force-quirck flag and the specifications say the following : " When a DOCTYPE token is created, its name, public identifier, and system identifier must be marked as missing (which is a distinct state from the empty string), and the *force-quirks flag* must be set to *off* (its other state is *on*)." So by default it has to be off(false). Now there is one thing I am not certain about and is if this output is the output after the parsing happens because I am testing the tokenizer without any of the tree constructions stages and this might be the problem. If I am wrong in any of the places please correct me so that I can know where I am going wrong. -- You received this message because you are subscribed to the Google Groups "html5lib-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscr...@googlegroups.com. To post to this group, send an email to html5lib-discuss@googlegroups.com. Visit this group at http://groups.google.com/group/html5lib-discuss. For more options, visit https://groups.google.com/groups/opt_out.