UTF-16 without BOM is exiled

Kornél Pál Thu, 07 May 2009 01:00:34 -0700

Hi,

After having a look at HTML5 editor's draft I believe that 8.2.2.3incorrectly instructs changing UTF-16 to UTF-8.

UTF-16 without BOM cannot be detected using the sniffing algorithmbecause is incompatible with ASCII. But the browser may guess (step 6.)that it's UTF-16 but it will only be tentative.


After the parser may find and encoding specified in the UTF-16 text.

If the encoding found is UTF-16 then that is instructed by step 1. of8.2.2.3 to be changed to UTF-8 that is definitely wrong.

Another problem is that if you were able to find an encoding name otherthan UTF-16 in a valid HTML code decoded as if it were UTF-16 youshouldn't restart parsing because if it isn't UTF-16 then the encodingfound is not accurate either.


4.2.5.5 also states:

If an HTML document does not start with a BOM, and if its encoding isnot explicitly given by Content-Type metadata, then the characterencoding used must be an ASCII-compatible character encoding

and

If an HTML document contains a meta element with a charset attribute ora meta element in the Encoding declaration state, then the characterencoding used must be an ASCII-compatible character encoding.

These together are equivalent to saying that UTF-16 without BOM is notallowed but I believe that this was not the intent. If it really is Iwould prefer to have an explicit note about this.

Encoding found by parsing using UTF-16 should be UTF-16 and any othervalues should be treated as a parse error.

Permitting UTF-16 without BOM makes sense because encoding autodetectionis permitted as well and ASCII compatible encodings having encodingspecified using <meta> will not reach the autodetection stage.


Best regards,
Kornél Pál

UTF-16 without BOM is exiled

Reply via email to