Tokenizer Tests Errors

Mohammad Houssami Sat, 20 Jul 2013 05:01:15 -0700

 

1-      I am currently working on building and HTML5 parser according to 
the specs of WHATWG and while testing the tokenizer using the tests on the 
HTML5Lib i have noticed some of them have bugs. In general these are the 
major things i have noticed:
I am refering to this set of tests 
:http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/test1.test


There are also some similar stuff in test2 test3 and test4 but lets just 
stick with test1 for now.

I have noticed that in places where you have doctype tokens like this one :
  
{"description":"Correct Doctype lowercase",
  
"input":"<!DOCTYPE html>",
  
"output":[["DOCTYPE", "html", null, null, true]]}
  
The force quirck flag is set to true where as the specifications say its 
usually on or off.

Like the example in the EOF here 
:http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#doctype-state

 

2-      In places where character tokens follow the tokenizer gives 1 
character token with all the character as data when it should give 1 
character token for every single character. Here is an example.
  
{"description":"Ampersand ampersand EOF",
  
"input":"&&",
  
"output":[["Character", "&&"]]}
  
My expected output for this is  having 2 character tokens each with 
ampersand data rather than just 1 token. 

 
  
3-      Assuming true stands for on and false for off,  many quirck flags 
are inverted where true(on) is given then it has to be false(off). The 
earlier case I gave is an example.

The states that should be covered with this input are the following: 
 DataState: <!DOCTYPE html>

Tag open state: <!DOCTYPE html>

     Markup deceleration open state: <!DOCTYPE html>
      Doctype State: : <!DOCTYPE html>

      Before doctype name state: <!DOCTYPE html>

      Doctype name state: <!DOCTYPE html>

Doctype name state: <!DOCTYPE html>

Doctype name state: <!DOCTYPE html>

   Doctype name state: <!DOCTYPE html>
The state says the following : U+003E GREATER-THAN SIGN (>)

Switch to the data 
state<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state>.
 
Emit the current DOCTYPE token.

And then in data state the EOF is read so there is nothing about the 
force-quirck flag and the specifications say the following : " When a 
DOCTYPE token is created, its name, public identifier, and system 
identifier must be marked as missing (which is a distinct state from the 
empty string), and the *force-quirks flag* must be set to *off* (its other 
state is *on*)." So by default it has to be off(false). 

Now there is one thing I am not certain about and is if this output is the 
output after the parsing happens because I am testing the tokenizer without 
any of the tree constructions stages and this might be the problem.

If I am wrong in any of the places please correct me so that I can know 
where  I am going wrong.

-- 
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to html5lib-discuss+unsubscr...@googlegroups.com.
To post to this group, send an email to html5lib-discuss@googlegroups.com.
Visit this group at http://groups.google.com/group/html5lib-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Tokenizer Tests Errors

Reply via email to