On Sun, Aug 16, 2009 at 2:41 AM, Anne van Kesteren<[email protected]> wrote: > On Sun, 16 Aug 2009 01:31:09 +0200, Erik van der Poel <[email protected]> > wrote: >> I had another look at section 2.7, and it does have a pointer to the >> IANA charset registry, which also says "However, no distinction is >> made between use of upper and lower case letters." This is the only >> matching rule that we need. UTS22 is too lenient, and we all know what >> happens to the Web when browsers are too lenient. If the discussion on >> [email protected] actually yields any more results, we may wish >> to consider adding them to HTML 5, but for now, I think having HTML 5 >> refer to the IANA charset registry is sufficient. > > So I made a few tests to figure out the matching rules and > case-insensitive does not seem like the only rule we need, though it > depends a bit on which browser we want to follow. I made a few tests > and run them through Opera (O), Firefox (F), and Chromium (C) (all on > Ubuntu):
It would also be interesting to find out what MSIE and Firefox on Windows do, and what Safari on Mac does. > http://dump.testsuite.org/2009/encoding-matching/ > > Ignoring the fact that C treats ISO-8859-9 as Windows-1254 (which the other > browsers should probably copy) the results are as follows: I agree that ISO-8859-9 should be treated as its "superset" Windows-1254. > Ignores leading whitespace: O, F, C Interesting. If MSIE and Firefox on Windows do this too, it would probably be a good idea to add this rule. > Ignores whitespace within label: O > Ignores leading ): O, C > Ignores trailing @: O, C > Allows underscores rather than hyphens for this encoding: O, C > Ignores @ within label: O, C If MSIE and Firefox on Windows do not do these, I think we should consider omitting these rules. > Now I'm positively certain that EUC-JP should not be recognized as > EUC_JP and quite certain that C does not recognize it as such so I'm > guessing ISO_8859_9 is an alias C supports, but documentation on that > would be good. Neither MSIE nor Firefox supports EUC_JP, so I don't know what Chromium is hoping to accomplish by recognizing it. EUC-JP is used much more often than EUC_JP on the Web. (About 500 times more often.) In fact, UFT-8 and ISO-8559-1 occur more often than EUC_JP. (Look carefully -- those are both misspellings.) Erik
