On Wed, Sep 28, 2011 at 4:16 AM, Jonas Sicking <[email protected]> wrote: > So it sounds like your argument is that we should do <meta> prescan > because we can do it without breaking any new ground. Not because it's > better or was inherently safer before webkit tried it out.
The outcome I am suggesting is that character encoding determination for text/html in XHR should be: 1) HTTP charset 2) BOM 3) <meta> prescan 4) UTF-8 My rationale is: * Restarting the parser sucks. Full heuristic detection and non-prescan <meta> require restarting. * Supporting HTTP charset, BOM and <meta> prescan means supporting all the cases where the author is declaring the encoding in a conforming way. * Supporting <meta> prescan even for responseText is safe to the extent content is not already broken in WebKit. * Not doing even heuristic detection on the first 1024 bytes allows us to avoid one of the unpredictability and non-interoperability-inducing legacy flaws that encumber HTML when loading it into a browsing context. * Using a clamped last resort encoding instead of a user setting or locale-dependent encoding allows us to avoid one of the unpredictability and non-interoperability-inducing legacy flaws that encumber HTML when loading it into a browsing context. * Using UTF-8 as opposed to Windows-1252 or a user setting or locale-dependent encoding as the last resort encoding allows the same encoding to be used in the responseXML and responseText cases without breaking existing responseText usage that expects UTF-8 (UTF-8 is the responseText default in Gecko). What outcome do you suggest and why? It seems you aren't suggesting doing stuff that involves a parser restart? Are you just arguing against UTF-8 as the last resort? > And in any case, it's easy to figure out where the > data was loaded from after the fact, so debugging doesn't seem any > harder. If that counts as "not harder", I concede this point. -- Henri Sivonen [email protected] http://hsivonen.iki.fi/
