Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Ian Hickson wrote: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. It is not clear what this means (e.g., the character set JIS_C6226-1983 in any encoding, or only when encoded alone according to RFC1345 as described above); This is talking about character encodings, not character sets. JIS_C6226-1983 is a registered character encoding in the IANA registry. Yes, I can understand this, but... On Fri, 23 Oct 2009, NARUSE, Yui wrote: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover those correct names as spec are JIS X 0208 and JIS X 0212. On Thu, 22 Oct 2009, �istein E. Andersen wrote: I am not sure what you mean; they are both listed at http://www.iana.org/assignments/character-sets: Name: JIS_C6226-1983 [RFC1345,KXS2] MIBenum: 63 Source: ECMA registry Alias: iso-ir-87 Alias: x0208 Alias: JIS_X0208-1983 Alias: csISO87JISX0208 Name: JIS_X0212-1990 [RFC1345,KXS2] MIBenum: 98 Source: ECMA registry Alias: x0212 Alias: iso-ir-159 Alias: csISO159JISX02121990 On Fri, 23 Oct 2009, NARUSE, Yui wrote: Where is the word JIS-X-0208 ? Where is the word JIS-X-0212 ? The exact string isn't there, that's why I included the preferred MIME names in brackets in the spec. if it is talking about character encodings, why it uses the name of character sets mainly? Following seems better. Authors should not use JIS_C6226-1983, JIS_X0212-1990, encodings based on ISO-2022, and encodings based On Fri, 23 Oct 2009, NARUSE, Yui wrote: Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII compatible. So they are out of discouraged; mustn't use. You can use non-ASCII-compatible encodings (e.g. UTF-16). I see. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Fri, 23 Oct 2009, NARUSE, Yui wrote: The exact string isn't there, that's why I included the preferred MIME names in brackets in the spec. if it is talking about character encodings, why it uses the name of character sets mainly? Following seems better. Authors should not use JIS_C6226-1983, JIS_X0212-1990, encodings based on ISO-2022, and encodings based Ok, done. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 23 Oct 2009, at 04:20, Ian Hickson wrote: On Wed, 21 Oct 2009, Øistein E. Andersen wrote: ASCII-compatibility: The note in 2.1.5 Character encodings seems to say that [...] ISO-2022[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot find anything in Section 2.1.5 that would explain this difference. HZ-GB-2312 uses the byte ASCII uses for ~ as the escape character. ISO-2022-* uses the control codes. That's the difference. '~'/0x7E is not (and should not be, as far as I can tell) relevant for HTML5's concept of ASCII compatibility. Discouraged encodings: [...] Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), [...] It is not clear what this means [...] This is talking about character encodings, not character sets. JIS_C6226-1983 is a registered character encoding in the IANA registry. (This is less confusing now since HTML5 only deals with character encodings and the strings match those in the the IANA registry as suggested by Yui Naruse.) the list of discouraged encodings seems conspicuously short if it is supposed to be complete; and the lack of rationale makes it difficult to understand why these encodings are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two at least initially puzzling cases). The reason for including these is to discourage encodings known to have security issues. I've added HZ-GB-2312, which can be used in a similarly dangerous fashion. (Basically the danger for user agents is in an attacker using an encoding that a user agent could autodetect, while a site interprets the bytes safely; that would allow those encodings to be used to smuggle script elements in a way that a naive whitelisting filter would think is safe.) It might be better to say *why* particular encodings are better avoided, whether or not the list of discouraged encodings be presented as definitive. I've added a note. [...] On Thu, 22 Oct 2009, Philip Taylor wrote: The string [숍訊昱穿] encoded as ISO-2022-KR is the bytes 0e 3c 73 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, when I last checked) will decode it as Windows-1252 and get the string script, which is bad. So a site that uses ISO-2022-KR is very likely to expose some users to XSS attacks, which seems like a good reason to discourage that encoding. The same applies to other ISO-2022 encodings. [...] On Thu, 22 Oct 2009, Øistein E. Andersen wrote: If that is the reason, at least HZ encoding would seem to be affected as well. Explicitly discouraging a more or less random subset of the problematic encdodings without providing rationale makes it difficult to assess whether or not other, somewhat similar, encodings should be avoided as well, which was the main issue I wanted to raise. Hopefully this is somewhat addressed now. The added note certainly helps, but it is vague (does [m]ost of these encodings mean all the encodings mentioned above apart from UTF-32?) and inaccurate (Philip Taylor's example does not rely on bugs). Given that the set of encodings is open-ended, I still think it would be preferable to make the rationale (a definition of what makes an encoding problematic) primary and mention actual encodings as examples. This could give something like the following: Encodings in which a series of bytes in the range 0x20..0x7E may encode characters other than the corresponding characters in the range U+20..U+7E represent a potential security vulnerability since a browser that does not support the encoding (or does not support the label used to declare the encoding, or does not use the same mechanism to detect the encoding of unlabelled content) might end up interpreting technically benign plain text content as HTML tags and JavaScript. In particular, this applies to encodings in which the bytes corresponding to 'script' in ASCII may encode a different string. Authors should not use such encodings, which are known to include In addition, authors should not use UTF-32 Alternatively, fixing the current note would help and might be sufficient, albeit not ideal. I think one has to realise that a comprehensive list of problematic encodings is an elusive goal and act accordingly. -- Øistein E. Andersen PS: The following sentence makes little sense without (curly) quotes and apostrophes. In case they disappeared before you read it, please find it repeated below with (ASCII) quotes and apostrophes: It should probably be advise against authors' using legacy encodings or better advise authors against using legacy encodings. (The current text in the spec is fine.)
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Fri, 23 Oct 2009, �istein E. Andersen wrote: On 23 Oct 2009, at 04:20, Ian Hickson wrote: On Wed, 21 Oct 2009, Øistein E. Andersen wrote: ASCII-compatibility: The note in 2.1.5 Character encodings seems to say that [...] ISO-2022[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot find anything in Section 2.1.5 that would explain this difference. HZ-GB-2312 uses the byte ASCII uses for ~ as the escape character. ISO-2022-* uses the control codes. That's the difference. '~'/0x7E is not (and should not be, as far as I can tell) relevant for HTML5's concept of ASCII compatibility. Good point. Moved the encoding over to the other side. The added note certainly helps, but it is vague (does [m]ost of these encodings mean all the encodings mentioned above apart from UTF-32?) and inaccurate (Philip Taylor's example does not rely on bugs). Given that the set of encodings is open-ended, I still think it would be preferable to make the rationale (a definition of what makes an encoding problematic) primary and mention actual encodings as examples. This could give something like the following: Encodings in which a series of bytes in the range 0x20..0x7E may encode characters other than the corresponding characters in the range U+20..U+7E represent a potential security vulnerability since a browser that does not support the encoding (or does not support the label used to declare the encoding, or does not use the same mechanism to detect the encoding of unlabelled content) might end up interpreting technically benign plain text content as HTML tags and JavaScript. In particular, this applies to encodings in which the bytes corresponding to 'script' in ASCII may encode a different string. Authors should not use such encodings, which are known to include In addition, authors should not use UTF-32 Alternatively, fixing the current note would help and might be sufficient, albeit not ideal. I've reworded the spec based on your suggestion. Thanks! -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen wrote: Discouraged encodings: ‘4.2.5.5 Specifying the document's character encoding’ advises against certain encodings. (Incidentally, this advice probably deserves not to be ‘hidden’ in a section nominally reserved for character encoding *declaration* issues.) In particular: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover those correct names as spec are JIS X 0208 and JIS X 0212. Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII compatible. So they are out of discouraged; mustn't use. Finally, Why ISO 2022 series is discouraged is not clear. Anyway, most of charsets defined RFC 1345 are not clear. Conversion table between Unicode is needed. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 22 Oct 2009, at 17:15, NARUSE, Yui wrote: First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, I am not sure what you mean; they are both listed at http://www.iana.org/assignments/character-sets: Name: JIS_C6226-1983 [RFC1345,KXS2] MIBenum: 63 Source: ECMA registry Alias: iso-ir-87 Alias: x0208 Alias: JIS_X0208-1983 Alias: csISO87JISX0208 Name: JIS_X0212-1990 [RFC1345,KXS2] MIBenum: 98 Source: ECMA registry Alias: x0212 Alias: iso-ir-159 Alias: csISO159JISX02121990 moreover those correct names as spec are JIS X 0208 and JIS X 0212. (The IANA registry is internally inconsistent and often disagrees with official standards when it comes to capitalisation, dashes/hyphens, underscores and spaces, so it is difficult to get this right. Please excuse me for not always paying due attention to such details in e- mails. Of course, the specifications should follow either IANA or the official standard as appropriate, depending on what it is referring to.) Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII compatible. So they are out of discouraged; mustn't use. EBCDIC is clearly not ASCII-compatible and may be unique amongst the character sets in the IANA registry in providing the full ASCII repertoire in a different arrangement. JIS_C6226-1983 and JIS_X0212-1990 as defined in RFC1345 (i.e., on their own) do not contain basic ASCII characters at all, so it makes little sense to use them for HTML documents without adding ASCII or the ASCII-based JIS C 6220-1969, which would give something like EUC- JP or ISO-2022-JP. JIS_C6226-1983 contains wide versions of ASCII characters, but those are not interpreted as HTML mark-up (unless I am mistaken). JIS_X0212-1990 does not contain ASCII, kana or basic kanji, so it is of extremely limited usefulness on its own even in a plain- text setting. Warning against completely useless encodings seems pointless. Many other encodings in the IANA registry are ASCII-incompatible in different ways; what I do not understand is what makes the ones currently mentioned in the HTML5 draft particularly harmful. Finally, Why ISO 2022 series is discouraged is not clear. We agree on this point. Anyway, most of charsets defined RFC 1345 are not clear. Conversion table between [those charsets and] Unicode is needed. Quite. Anne van Kesteren, I and several others are currently trying to document how browsers handle different encodings at http://wiki.whatwg.org/wiki/Web_Encodings, and defining mappings to Unicode is one of the goals. Your contribution would be much appreciated. -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Øistein E. Andersen wrote: On 22 Oct 2009, at 17:15, NARUSE, Yui wrote: First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, I am not sure what you mean; they are both listed at http://www.iana.org/assignments/character-sets: Name: JIS_C6226-1983 [RFC1345,KXS2] MIBenum: 63 Source: ECMA registry Alias: iso-ir-87 Alias: x0208 Alias: JIS_X0208-1983 Alias: csISO87JISX0208 Where is the word JIS-X-0208 ? Name: JIS_X0212-1990 [RFC1345,KXS2] MIBenum: 98 Source: ECMA registry Alias: x0212 Alias: iso-ir-159 Alias: csISO159JISX02121990 Where is the word JIS-X-0212 ? moreover those correct names as spec are JIS X 0208 and JIS X 0212. Please excuse me for not always paying due attention to such details in e-mails. Of course, the specifications should follow either IANA or the official standard as appropriate, depending on what it is referring to.) Not for you, this sentense is in current HTML5 Draft 4.2.5.5. That is why I paid attention. Anyway, most of charsets defined RFC 1345 are not clear. Conversion table between [those charsets and] Unicode is needed. Quite. Anne van Kesteren, I and several others are currently trying to document how browsers handle different encodings at http://wiki.whatwg.org/wiki/Web_Encodings, and defining mappings to Unicode is one of the goals. Your contribution would be much appreciated. ICU has large set of tables which likely to cover many MS Codepages. (Of course it should be verified) http://bugs.icu-project.org/trac/browser/data/trunk/charset/data/ucm And I have a CP51932 table made from .NET Framework's Coonverter. http://nkf.sourceforge.jp/ucm/cp51932.ucm -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Thu, Oct 22, 2009 at 9:23 PM, Øistein E. Andersen li...@coq.no wrote: On 22 Oct 2009, at 17:15, NARUSE, Yui wrote: Finally, Why ISO 2022 series is discouraged is not clear. We agree on this point. The string 숍訊昱穿 encoded as ISO-2022-KR is the bytes 0e 3c 73 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, when I last checked) will decode it as Windows-1252 and get the string script, which is bad. So a site that uses ISO-2022-KR is very likely to expose some users to XSS attacks, which seems like a good reason to discourage that encoding. The same applies to other ISO-2022 encodings. -- Philip Taylor exc...@gmail.com
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 22 Oct 2009, at 22:45, Philip Taylor wrote: On Thu, Oct 22, 2009 at 9:23 PM, Øistein E. Andersen li...@coq.no wrote: On 22 Oct 2009, at 17:15, NARUSE, Yui wrote: Finally, Why ISO 2022 series is discouraged is not clear. We agree on this point. The string 숍訊昱穿 encoded as ISO-2022-KR is the bytes 0e 3c 73 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, when I last checked) will decode it as Windows-1252 and get the string script, which is bad. [...] If that is the reason, at least HZ encoding would seem to be affected as well. Explicitly discouraging a more or less random subset of the problematic encdodings without providing rationale makes it difficult to assess whether or not other, somewhat similar, encodings should be avoided as well, which was the main issue I wanted to raise. -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Wed, 21 Oct 2009, �istein E. Andersen wrote: ASCII-compatibility: The note in �2.1.5 Character encodings� seems to say that �variants of ISO-2022� (presumably including common ones like ISO-2022-CN, ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot find anything in Section 2.1.5 that would explain this difference. HZ-GB-2312 uses the byte ASCII uses for ~ as the escape character. ISO-2022-* uses the control codes. That's the difference. Discouraged encodings: �4.2.5.5 Specifying the document's character encoding� advises against certain encodings. In particular: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. It is not clear what this means (e.g., the character set JIS_C6226-1983 in any encoding, or only when encoded alone according to RFC1345 as described above); This is talking about character encodings, not character sets. JIS_C6226-1983 is a registered character encoding in the IANA registry. the list of discouraged encodings seems conspicuously short if it is supposed to be complete; and the lack of rationale makes it difficult to understand why these encodings are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two at least initially puzzling cases). The reason for including these is to discourage encodings known to have security issues. I've added HZ-GB-2312, which can be used in a similarly dangerous fashion. (Basically the danger for user agents is in an attacker using an encoding that a user agent could autodetect, while a site interprets the bytes safely; that would allow those encodings to be used to smuggle script elements in a way that a naive whitelisting filter would think is safe.) It might be better to say *why* particular encodings are better avoided, whether or not the list of discouraged encodings be presented as definitive. I've added a note. (Incidentally, this advice probably deserves not to be �hidden� in a section nominally reserved for character encoding *declaration* issues.) Yeah. I considered moving it to the Writing HTML documents section, but that one doesn't apply to conformance checkers, so it ends up being more of a pain, since the advice would have to be split into multiple pieces so that it applied appropriately. It's not a big deal. Minor grammar detail in 4.2.5.5: Conformance checkers may advise against authors using legacy encodings. This is ambiguous. It should probably be �advise against authors� using legacy encodings� or better �advise authors against using legacy encodings�. Fixed. On Fri, 23 Oct 2009, NARUSE, Yui wrote: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover those correct names as spec are JIS X 0208 and JIS X 0212. On Thu, 22 Oct 2009, �istein E. Andersen wrote: I am not sure what you mean; they are both listed at http://www.iana.org/assignments/character-sets: Name: JIS_C6226-1983 [RFC1345,KXS2] MIBenum: 63 Source: ECMA registry Alias: iso-ir-87 Alias: x0208 Alias: JIS_X0208-1983 Alias: csISO87JISX0208 Name: JIS_X0212-1990 [RFC1345,KXS2] MIBenum: 98 Source: ECMA registry Alias: x0212 Alias: iso-ir-159 Alias: csISO159JISX02121990 On Fri, 23 Oct 2009, NARUSE, Yui wrote: Where is the word JIS-X-0208 ? Where is the word JIS-X-0212 ? The exact string isn't there, that's why I included the preferred MIME names in brackets in the spec. On Fri, 23 Oct 2009, NARUSE, Yui wrote: Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII compatible. So they are out of discouraged; mustn't use. You can use non-ASCII-compatible encodings (e.g. UTF-16). Finally, Why ISO 2022 series is discouraged is not clear. Hopefully this is clear now. Anyway, most of charsets defined RFC 1345 are not clear. Conversion table between Unicode is needed. On Thu, 22 Oct 2009, �istein E. Andersen wrote: moreover those correct names as spec are JIS X 0208 and JIS X 0212. (The IANA registry is internally inconsistent and often disagrees with official standards when it comes to capitalisation, dashes/hyphens, underscores and spaces, so it is difficult to get this right. Please excuse me for not always paying due attention to such details in e-mails. Of course, the specifications should follow either IANA or the official standard as appropriate, depending on what it is referring to.) Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII compatible. So they are out of discouraged; mustn't use. EBCDIC is clearly not ASCII-compatible and may be unique amongst the character sets in the IANA
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 19 Oct 2009, at 05:52, Ian Hickson wrote: I've noted your e-mail here [...] and moved the whole thing out of the spec. That does not seem to apply to the last part of the original e-mail, quoted below. Øistein E. Andersen Other character encoding issues: ASCII-compatibility: The note in ‘2.1.5 Character encodings’ seems to say that ‘variants of ISO-2022’ (presumably including common ones like ISO-2022-CN, ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot find anything in Section 2.1.5 that would explain this difference. Discouraged encodings: ‘4.2.5.5 Specifying the document's character encoding’ advises against certain encodings. (Incidentally, this advice probably deserves not to be ‘hidden’ in a section nominally reserved for character encoding *declaration* issues.) In particular: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. It is not clear what this means (e.g., the character set JIS_C6226-1983 in any encoding, or only when encoded alone according to RFC1345 as described above); the list of discouraged encodings seems conspicuously short if it is supposed to be complete; and the lack of rationale makes it difficult to understand why these encodings are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two at least initially puzzling cases). It might be better to say *why* particular encodings are better avoided, whether or not the list of discouraged encodings be presented as definitive. Minor grammar detail in 4.2.5.5: Conformance checkers may advise against authors using legacy encodings. This is ambiguous. It should probably be ‘advise against authors’ using legacy encodings’ or better ‘advise authors against using legacy encodings’.
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Sat, 18 Jul 2009, Øistein E. Andersen wrote: On 7 Jul 2009, at 09:25, Ian Hickson wrote: On Tue, 9 Jun 2009, Anne van Kesteren wrote: [S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does not appear in the IANA registry.) I've added this mapping too, just in case. Added x-sjis. What are the other mappings that would be good? Potentially quite a few... The following do not appear in the IANA registry and seem to be supported in IE as well as in at least two of the three browsers Safari, Firefox and Opera. [...] I've noted your e-mail here: http://wiki.whatwg.org/wiki/Web_Encodings#E-mails ...and moved the whole thing out of the spec. I think the conclusion is that we should just do this using IANA aliases. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 7 Jul 2009, at 09:25, Ian Hickson wrote: On Tue, 9 Jun 2009, Anne van Kesteren wrote: [S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does not appear in the IANA registry.) I've added this mapping too, just in case. Added x-sjis. What are the other mappings that would be good? Potentially quite a few... The following do not appear in the IANA registry and seem to be supported in IE as well as in at least two of the three browsers Safari, Firefox and Opera. Aliases for EUC-CN or GB2312-80, ultimately mapping to GBK: - EUC-CN - x-euc-cn - CN-GB - csGB231280 Alias for EUC-JP: - X-EUC-JP Aliases for Big5: - cn-big5 - x-x-big5 (already in HTML5) Aliases for Shift_JIS or Windows-31J (which was originally called Shift_JIS): - x-sjis (already in HTML5) Alias for windows-1256: - cp1256 Name and alias for windows-874 (which does not seem to appear in the IANA registry): - windows-874 - DOS-874 In addition, the following legacy Macintosh encodings enjoy universal support (IE, Safari, Firefox, Opera), but do not appear in the IANA registry: - x-mac-icelandic - x-mac-arabic (somewhat incomplete implementation in IE) - x-mac-ce (Central-European) - x-mac-croatian - x-mac-romanian - x-mac-cyrillic - x-mac-ukrainian - x-mac-greek - x-mac-turkish Windows-932 is not supported in IE7 and may not be necessary; others should probably be added if windows-932 is deemed necessary. I've split the table in two to avoid this issue. It looks much better now. (The terminology is perhaps slightly inconsistent, but that can be fixed later.) Earlier, you wrote: GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80, [...]. GBK, on the other hand, is an encoding. As far as I can tell, GB2312 and GB_2312-80 are two different encodings according to IANA. Indeed. The following CJK character sets are listed as encodings in the IANA registry: - JIS_C6226-1978 - JIS_C6226-1983 - JIS_X0212-1990 - GB_2312-80 - KS_C_5601-1987 All these character sets are defined as a 94x94 matrix with rows and columns numbered from 1 to 94 (inclusive). According to RFC1345, a character is to be encoded as the two-byte sequence (row number + 32), (column number + 32) in the eponymous encoding. (The two-byte sequences are thus the same as in an ISO-2022 encoding, but only one character set is available, and there are no escape sequences or anything remotely similar.) In addition, GB_2312, which is really GB_2312-80 with the year omitted, has been defined as what is properly known as EUC-CN. JIS_C6226-1978, JIS_C6226-1983 and JIS_X0212-1990 do not seem to be supported in browsers at all. Both GB_2312-80 and GB_2312 are taken to mean GBK, which is a superset of EUC-CN. KS_C_5601-1987 is taken to mean windows-949, a superset of EUC-KR, in Safari, Firefox and Opera (IE treats it as the union of windows-949 and ISO-2022-KR, which may or may not be needed for compatibility). This is all quite confusing, and what is called GB_2312 in IANA really should be renamed to EUC-CN (keeping GB_2312 as an alias). The HTML5 tables are now technically correct (provided that the encoding names be interpreted strictly according to the IANA registry). Very minor detail: The capitalisation of Windows/windows is inconsistent in the IANA registry; you would have to write, e.g., windows-932 and Windows-31J to follow IANA. Other character encoding issues: ASCII-compatibility: The note in ‘2.1.5 Character encodings’ seems to say that ‘variants of ISO-2022’ (presumably including common ones like ISO-2022-CN, ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot find anything in Section 2.1.5 that would explain this difference. Discouraged encodings: ‘4.2.5.5 Specifying the document's character encoding’ advises against certain encodings. (Incidentally, this advice probably deserves not to be ‘hidden’ in a section nominally reserved for character encoding *declaration* issues.) In particular: Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC. It is not clear what this means (e.g., the character set JIS_C6226-1983 in any encoding, or only when encoded alone according to RFC1345 as described above); the list of discouraged encodings seems conspicuously short if it is supposed to be complete; and the lack of rationale makes it difficult to understand why these encodings are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two at least initially puzzling cases). It might be better to say *why* particular encodings are better avoided, whether or not the list of discouraged encodings be presented as definitive. Minor grammar detail in 4.2.5.5: Conformance checkers may advise against authors
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Tue, 9 Jun 2009, Anne van Kesteren wrote: On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen li...@coq.no wrote: Le 5 juin 09, Anne van Kesteren écrivit : Is the implication here that Shift_JIS and Shift-JIS are distinct [...]? No, Shift-JIS and Windows-932 are commonly used names/labels for the encodings that are registered as Shift_JIS and Windows-31J (respectively) in the IANA charset registry. Sorry for the confusion caused. So should HTML5 mention that Windows-932 maps to Windows-31J? (It does not appear in the IANA registry.) I've added this mapping too, just in case. On Tue, 9 Jun 2009, �istein E. Andersen wrote: That is an interesting question. My (apparently wrong) understanding was that the table was merely supposed to provide mappings between encodings, since such mappings are inappropriate in non-HTML contexts and cannot be added to the IANA registry. It might be to useful to include a set of MIME charset strings which cannot be or have not yet been registered (e.g., x-x-big5, x-sjis, windows-932) as well as information on how CJK character sets are implemented in practice, both of which seem to be necessary for compatibility. Such information does not fit comfortably in the current table, though. Added x-sjis. What are the other mappings that would be good? On Tue, 9 Jun 2009, �istein E. Andersen wrote: I believe you misunderstand the purpose of this table. The idea is to give a mapping of _labels_ to encodings, not encodings to encodings. I've clarified the text to this effect. You seem to have added specified by a label to the phrase which now reads an encoding specified by a label given in the first column of the following table without changing the column heading (Input encoding) and without defining what a label actually is. The reference to encoding aliasing is also intact, which seems misleading if the table is not supposed to map between encodings. I've split the table in two to avoid this issue. Earlier, you wrote: GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80, [...]. GBK, on the other hand, is an encoding. As far as I can tell, GB2312 and GB_2312-80 are two different encodings according to IANA. On Wed, 10 Jun 2009, Anne van Kesteren wrote: I would prefer them being added to the IANA registry. I've noted that I should do that. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Le 10 juin 09 à 09:06, Anne van Kesteren a écrit : It is about adding aliases. If the alias added is also a distinct encoding conformance checkers are supposed to report on the differences. That probably has to be made more explicit, then. Personally I would be happy with making the aliases normative everywhere but I suspect that is not going to fly. E.g. letting US- ASCII always map to Windows-1252 would probably be highly controversial. That particular mapping may not actually be necessary (IE8 maps 8-bit US-ASCII to U+FFFD, and several previous versions of IE ignore the high bit), so making the other aliases normative still seems worth considering. There are a few aliases whose name starts with x-, though. I would prefer them being added to the IANA registry. Sure. It might be to useful to include a set of MIME charset strings which cannot be or have not yet been registered (e.g., x-x-big5, x-sjis, windows-932) as well as information on how CJK character sets are implemented in practice, both of which seem to be necessary for compatibility. Such information should definitely be included, yes. In that case, it would probably be less confusing and more accurate to have one table mapping between encodings (or from preferred MIME name to encoding or something along those lines) and another table adding additional MIME charset strings. Since you seem to have studied this subject a lot, do you keep more detailed information somewhere including tests, findings, tables, etc? It would be very cool to have that. Most of the relevant findings have been sent to the WhatWG list as part of the current thread. The following messages contain links to tables and tests: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014190.html http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-July/015455.html http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-April/019322.html Some of the tables and tests may be difficult to interpret, so please feel free to ask if you have any questions. -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Le 3 juin 09 à 23h19, Ian Hickson écrivit : On Tue, 14 Apr 2009, Øistein E. Andersen wrote: HTML5 currently contains a table of encodings aliases, [...] GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80, [...]. GBK, on the other hand, is an encoding. [...] There is a large number of unregistered charset strings, however, and the other mappings in this table are between encodings. Unless x-x-big5 is actually supposed to refer to an encoding distinct from Big5, [this mapping] should be removed. [...] I believe you misunderstand the purpose of this table. The idea is to give a mapping of _labels_ to encodings, not encodings to encodings. I've clarified the text to this effect. You seem to have added specified by a label to the phrase which now reads an encoding specified by a label given in the first column of the following table without changing the column heading (Input encoding) and without defining what a label actually is. The reference to encoding aliasing is also intact, which seems misleading if the table is not supposed to map between encodings. The concept of misinterpret[ation] for compatibility seems inappropriate for the mapping from x-x-big5 to Big5 unless the label x-x-big5 is actually supposed to specify an encoding distinct from Big5. It is not at all clear to me what you mean by label. It might be the MIME charset string with which the HTML document is labelled, but that would require an inordinate number of strings to be specified (e.g., iso-ir-100, latin1 and IBM819 amongst others alongside ISO-8859-1), so this cannot possibly be the intended meaning. It might be a normalised form of the MIME charset string, using the IANA charset registry to map an alias to its corresponding name (or to the alias qualified as preferred MIME name if there is such an entry), but that does not quite seem to work either, since aliases not registered in the IANA charset registry would then not be covered by the aliasing mechanism (e.g., it would cause content labelled as x-sjis to be handled as unaugmented Shift_JIS despite the mapping from Shift_JIS to Windows-31J, since x-sjis does not and cannot figure in the IANA charset registry). I did indeed believe that the table was supposed to map between encodings, and this interpretation still seems to give the correct result in practice for non-CJK encodings (unless, of course, content labelled TIS-620-2533 should actually be interpreted as TIS-620 rather than windows-874). Le 9 juin 09 à 10h55, Anne van Kesteren écrivit : On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen wrote: Shift-JIS and Windows-932 are commonly used names/labels for the encodings that are registered as Shift_JIS and Windows-31J (respectively) in the IANA charset registry. [...] So should HTML5 mention that Windows-932 maps to Windows-31J? (It does not appear in the IANA registry.) That is an interesting question. My (apparently wrong) understanding was that the table was merely supposed to provide mappings between encodings, since such mappings are inappropriate in non-HTML contexts and cannot be added to the IANA registry. It might be to useful to include a set of MIME charset strings which cannot be or have not yet been registered (e.g., x-x-big5, x-sjis, windows-932) as well as information on how CJK character sets are implemented in practice, both of which seem to be necessary for compatibility. Such information does not fit comfortably in the current table, though. -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Tue, 14 Apr 2009, Øistein E. Andersen wrote: Shift_JIS Windows-31J [...] Shift-JIS Windows-932 Le 5 juin 09, Anne van Kesteren écrivit : Is the implication here that Shift_JIS and Shift-JIS are distinct [...]? No, Shift-JIS and Windows-932 are commonly used names/labels for the encodings that are registered as Shift_JIS and Windows-31J (respectively) in the IANA charset registry. Sorry for the confusion caused. -- Øistein E. Andersen PS: Sorry for the belated reply, partly caused by a hard-drive break- down while I was away.
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Is the implication here that Shift_JIS and Shift-JIS are distinct despite the encoding matching rules in Unicode not allowing for that? If that is the case I think we need new matching rules. If the implication is something else I'd like to know. On Thu, 04 Jun 2009 00:19:05 +0200, Ian Hickson i...@hixie.ch wrote: On Tue, 14 Apr 2009, Øistein E. Andersen wrote: [...] In addition, Shift_JIS Windows-31J, and all browsers implement this mapping, so the following should be added: Shift_JIS - Windows-31J Added. [...] Shift-JIS encoding for Japanese === Shift-JIS supports: - ASCII - Katakana - JIS X 0208-1990/1997 All browsers furthermore supports NEC symbols as well as IBM extensions in both NEC and IBM (Shift-JIS) positions. This is actually Windows-932: Shift-JIS Windows-932 [...] -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Fri, 5 Jun 2009, Anne van Kesteren wrote: Is the implication here that Shift_JIS and Shift-JIS are distinct despite the encoding matching rules in Unicode not allowing for that? If that is the case I think we need new matching rules. If the implication is something else I'd like to know. I don't understand the question. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Fri, 5 Jun 2009, Anne van Kesteren wrote: On Fri, 05 Jun 2009 10:14:46 +0200, Ian Hickson i...@hixie.ch wrote: On Fri, 5 Jun 2009, Anne van Kesteren wrote: Is the implication here that Shift_JIS and Shift-JIS are distinct despite the encoding matching rules in Unicode not allowing for that? If that is the case I think we need new matching rules. If the implication is something else I'd like to know. I don't understand the question. Part of my email was the data that Shift_JIS supposedly is a subset of Windows-31J and Shift-JIS supposedly is a subset of Windows-932. (Note the dash versus underscore.) Ah, ok. I thought you were refering to the change I made to the spec. My apologies. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
I haven't made any changes to the spec based on the feedback below. Let me know if there's anything I missed. I'm not aware of any specific problems at this time. On Sat, 11 Apr 2009, Øistein E. Andersen wrote: On 22 May 2008, at 12:40, Ian Hickson wrote: Do you have input on the EUC-JP issue? I am now about to finish my analysis of CJK encodings (e-mail forthcoming), including EUC-JP. This encoding does not seem to be particularly problematic, however. Are you referring to a specific problem? On Thu, 13 Mar 2008, Øistein E. Andersen wrote: Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from ISO-2022-JP. This is something to keep in mind when looking at multi-byte encodings. What should we say about this? The issue seems to be that IE's implementation of ISO-2022-JP is a large superset of what is actually specified. (This is the case for several other CJK encodings as well.) See forthcoming e-mail for an actual description of the extensions. (TC)VN5712-2 (TC)VN5712-1 Opera[?] and Firefox seem to have implemented the superset only. Should we require this mapping? For reference: (TC)VN5712-3(TC)VN5712-2 = VSCII-2 = ISO IR 180(TC)VN5712-1 Only the complete set seems to be implemented (and only in Firefox), and MIME charset strings referring to one of the subsets do not seem to work at all, so no mappings are necessary. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Sun, 12 Apr 2009, Øistein E. Andersen wrote: On 2 Sep 2008, at 06:06, Ian Hickson wrote: On Wed, 30 Jul 2008, Øistein E. Andersen wrote: 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252. IE7, on the other hand, simply ignores the high bit (as it does for a few other 7-bit encodings, by the way). Perhaps this alias could be dropped from the other browsers. Ignoring the high bit seems like a dangerous security bug; dropping any character with a high bit as U+FFFD seems unnecessarily drastic. According to a test I did using browsershots.org, IE8 actually seems to do this (8-bit characters are rendered as squares), which looks like an argument in favour of the more `drastic' option. I've made the spec go with the O/F/S behaviour here. This has the advantage of not adding ASCII as a separate encoding, and Windows-1252 is presumably one of the encodings most often mislabelled as ASCII. However, IE has ignored the high bit at least since 5.01 (IE4 via browsershots.org treats it as CP1252, but this could well be locale-dependent), so there may not be that many mislabelled pages. Has anyone got a list of pages which are labelled as ASCII and contain 8-bit characters? This is probably not very important. U+FFFD is `purer', Windows-1252 has the potential of rescuing a few pages. It is however essential that 8-bit characters be considered not conforming since they do not in fact work (as Windows-1252 bytes) in IE5-IE8. This is currently the case, but I think Henri Sivonen has argued that `misinterpretation for compatibility' should not be considered a conformance error (which would probably be fairly harmless for other mappings). I (and the spec) agree with you here, that these should be reported as errors. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Tue, 14 Apr 2009, Øistein E. Andersen wrote: This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for `Chinese characters', excluding those which are not supported by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth `all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important to gain a general understanding of the relevant issues. To avoid unnecessary confusion, the following three concepts are kept distinct: 1) Character set: A collection of characters, typically defined as a matrix with 94 rows and 94 columns. (A character set with more than one matrix is said to have multiple planes.) The ones officially registered `for use with escape sequences' (typically in ISO-2022 encodings, see below) can be found at http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm. 2) Encoding: Defines how a given character (typically defined by its row and column numbers) from a given character set can be encoded as a sequence of bytes. All the encodings discussed below allow multiple character sets to be encoded. [ISO-2022 encodings use only 7-bit bytes and employ escape sequences to switch between different character sets. EUC encodings use bytes 128 for ASCII (or something similar) and bytes = 128 to encode other character sets.] 3) MIME charset string: This is the string used, e.g., in a HTTP Content-Type header to indicate the *encoding*. Many of these can be found at http://www.iana.org/assignments/character-sets. Some information about browser support for specific character sets, encodings and MIME charset strings can be found at http://coq.no/character-tables/mime/iso-2022/en, http://coq.no/character-tables/mime/euc/en and http://coq.no/character-tables/mime/locale-specific/en. The notation a b means that a is a proper subset of b; a and b can be either character sets or encodings. ** * What should HTML 5 say about all this? * ** This section gives a summary of superset encodings which are either universally supported or potentially needed for compatibility. (Anyone who is going to read the entire e-mail will probably prefer to read the sections *Chinese*, *Japanese* and *Korean* at this point and return to this section afterwards.) Superset encodings (stricto sensu) -- HTML5 currently contains a table of encodings aliases, of which the following involve Chinese characters: 1) EUC-KR - Windows-949 2) GB2312 - GBK 3) GB_2312-80 - GBK 4) KS_C_5601-1987 - Windows-949 5) x-x-big5- Big5 EUC-KR Windows-949, and all browsers do 1), so this is reasonable and probably needed. GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80, which can be expressed not only in EUC-CN encoding, but also in ISO-2022-CN encoding and HZ encoding. GBK, on the other hand, is an encoding. EUC-CN GBK. It would be more correct to remove 2) and 3) and instead add: EUC-CN - GBK Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and registered MIME charset strings include GB_2312-80 and GB_2312-80 as distinct entries (but not EUC-CN), so a note to this effect might be appropriate. (Additionally, GBK is slightly ambiguous, so make sure not to reference an incomplete or outdated version without pointing out necessary amendments/additions.) Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or `KS_C_5601-1987', which Ken Lunde characterises as `incorrect and dangerous' in his book /CJKV Information Processing/. It would be more correct to remove 4). Unlike EUC-CN, EUC-KR is a registered MIME charset string, but KS_C_5601-1987 has a distinct entry, so a note might again be appropriate. As for 5), the MIME charset string x-x-big5 does indeed correspond to Big5 encoding (or rather an extension thereof) in all browsers but Opera. There is a large number of unregistered charset strings, however, and the other mappings in this table are between encodings. Unless x-x-big5 is actually supposed to refer to an encoding distinct from Big5, 5) should be removed. Instead (depending on the reference ultimately given for Big5), it may be necessary to note that at least certain ETen extensions should be regarded as part of Big5. I believe you misunderstand the purpose of this table. The idea is to give a mapping of _labels_ to encodings, not encodings to encodings. I've clarified the text to this effect. In addition, Shift_JIS Windows-31J, and all browsers implement this mapping, so the following should be added: Shift_JIS - Windows-31J Added. I haven't added the mappings described below, since they are not all implemented uniformly. If specific mappings are
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
This e-mail is an attempt to give a relatively concise yet reasonably complete overview of non-Unicode character sets and encodings for `Chinese characters', excluding those which are not supported by at least one of the four browsers IE, Safari, Firefox and Opera (henceforth `all browsers'), and tentatively avoiding technical details which are out of scope for HTML5 unless they are important to gain a general understanding of the relevant issues. To avoid unnecessary confusion, the following three concepts are kept distinct: 1) Character set: A collection of characters, typically defined as a matrix with 94 rows and 94 columns. (A character set with more than one matrix is said to have multiple planes.) The ones officially registered `for use with escape sequences' (typically in ISO-2022 encodings, see below) can be found at http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm . 2) Encoding: Defines how a given character (typically defined by its row and column numbers) from a given character set can be encoded as a sequence of bytes. All the encodings discussed below allow multiple character sets to be encoded. [ISO-2022 encodings use only 7-bit bytes and employ escape sequences to switch between different character sets. EUC encodings use bytes 128 for ASCII (or something similar) and bytes = 128 to encode other character sets.] 3) MIME charset string: This is the string used, e.g., in a HTTP Content-Type header to indicate the *encoding*. Many of these can be found at http://www.iana.org/assignments/character-sets. Some information about browser support for specific character sets, encodings and MIME charset strings can be found at http://coq.no/character-tables/mime/iso-2022/en , http://coq.no/character-tables/mime/euc/en and http://coq.no/character-tables/mime/locale-specific/en . The notation a b means that a is a proper subset of b; a and b can be either character sets or encodings. ** * What should HTML 5 say about all this? * ** This section gives a summary of superset encodings which are either universally supported or potentially needed for compatibility. (Anyone who is going to read the entire e-mail will probably prefer to read the sections *Chinese*, *Japanese* and *Korean* at this point and return to this section afterwards.) Superset encodings (stricto sensu) -- HTML5 currently contains a table of encodings aliases, of which the following involve Chinese characters: 1) EUC-KR - Windows-949 2) GB2312 - GBK 3) GB_2312-80 - GBK 4) KS_C_5601-1987 - Windows-949 5) x-x-big5- Big5 EUC-KR Windows-949, and all browsers do 1), so this is reasonable and probably needed. GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80, which can be expressed not only in EUC-CN encoding, but also in ISO-2022-CN encoding and HZ encoding. GBK, on the other hand, is an encoding. EUC-CN GBK. It would be more correct to remove 2) and 3) and instead add: EUC-CN - GBK Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and registered MIME charset strings include GB_2312-80 and GB_2312-80 as distinct entries (but not EUC-CN), so a note to this effect might be appropriate. (Additionally, GBK is slightly ambiguous, so make sure not to reference an incomplete or outdated version without pointing out necessary amendments/additions.) Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or `KS_C_5601-1987', which Ken Lunde characterises as `incorrect and dangerous' in his book /CJKV Information Processing/. It would be more correct to remove 4). Unlike EUC-CN, EUC-KR is a registered MIME charset string, but KS_C_5601-1987 has a distinct entry, so a note might again be appropriate. As for 5), the MIME charset string x-x-big5 does indeed correspond to Big5 encoding (or rather an extension thereof) in all browsers but Opera. There is a large number of unregistered charset strings, however, and the other mappings in this table are between encodings. Unless x-x-big5 is actually supposed to refer to an encoding distinct from Big5, 5) should be removed. Instead (depending on the reference ultimately given for Big5), it may be necessary to note that at least certain ETen extensions should be regarded as part of Big5. In addition, Shift_JIS Windows-31J, and all browsers implement this mapping, so the following should be added: Shift_JIS - Windows-31J Further superset encodings (probably not needed) ISO-2022-CN ISO-2022-CN-EXT This is reasonable, but probably not necessary: Firefox does it, Safari does not, Opera does not implement the superset, IE does not even implement the subset. Distinguishing between them is pointless.
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 2 Sep 2008, at 06:06, Ian Hickson wrote: On Wed, 30 Jul 2008, Øistein E. Andersen wrote: 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252. IE7, on the other hand, simply ignores the high bit (as it does for a few other 7-bit encodings, by the way). Perhaps this alias could be dropped from the other browsers. Ignoring the high bit seems like a dangerous security bug; dropping any character with a high bit as U+FFFD seems unnecessarily drastic. According to a test I did using browsershots.org, IE8 actually seems to do this (8-bit characters are rendered as squares), which looks like an argument in favour of the more `drastic' option. I've made the spec go with the O/F/S behaviour here. This has the advantage of not adding ASCII as a separate encoding, and Windows-1252 is presumably one of the encodings most often mislabelled as ASCII. However, IE has ignored the high bit at least since 5.01 (IE4 via browsershots.org treats it as CP1252, but this could well be locale-dependent), so there may not be that many mislabelled pages. Has anyone got a list of pages which are labelled as ASCII and contain 8-bit characters? This is probably not very important. U+FFFD is `purer', Windows-1252 has the potential of rescuing a few pages. It is however essential that 8-bit characters be considered not conforming since they do not in fact work (as Windows-1252 bytes) in IE5-IE8. This is currently the case, but I think Henri Sivonen has argued that `misinterpretation for compatibility' should not be considered a conformance error (which would probably be fairly harmless for other mappings). 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite inconsistently; [...] I think the HTML5 spec does what is necessary here, but it may be that the encodings specs are vague still. [For the record, HTML5 currently requires delete and C1 characters (as well as C0 save white space) to be replaced by U+FFFD during `pre- processing of the input stream', which effectively circumvents the problem that character encoding specifications treat this range in a vague and inconsistent manner.] -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 22 May 2008, at 12:40, Ian Hickson wrote: Do you have input on the EUC-JP issue? I am now about to finish my analysis of CJK encodings (e-mail forthcoming), including EUC-JP. This encoding does not seem to be particularly problematic, however. Are you referring to a specific problem? On Thu, 13 Mar 2008, Øistein E. Andersen wrote: Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from ISO-2022-JP. This is something to keep in mind when looking at multi-byte encodings. What should we say about this? The issue seems to be that IE's implementation of ISO-2022-JP is a large superset of what is actually specified. (This is the case for several other CJK encodings as well.) See forthcoming e-mail for an actual description of the extensions. (TC)VN5712-2 (TC)VN5712-1 Opera[?] and Firefox seem to have implemented the superset only. Should we require this mapping? For reference: (TC)VN5712-3(TC)VN5712-2 = VSCII-2 = ISO IR 180(TC)VN5712-1 Only the complete set seems to be implemented (and only in Firefox), and MIME charset strings referring to one of the subsets do not seem to work at all, so no mappings are necessary. -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Wed, 30 Jul 2008, �istein E. Andersen wrote: The current table seems to cover the mappings between different common compatible 8-bit encodings as implemented in IE7, yes. The table at http://coq.no/character-tables/mime/en gives a bit more detail, most of which is better kept outside HTML5 itself. However, the following observations can be made: 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252. IE7, on the other hand, simply ignores the high bit (as it does for a few other 7-bit encodings, by the way). Perhaps this alias could be dropped from the other browsers. Ignoring the high bit seems like a dangerous security bug; dropping any character with a high bit as U+FFFD seems unnecessarily drastic. I've made the spec go with the O/F/S behaviour here. 2. Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as per HTML5), whereas Safari seems to do the same for text/plain; charset=ISO-8859-11 instead [Version 3.1.2 (5525.20.1)]. Bug? I believe so. 3. For certain character sets, different browsers map to different, but visually similar Unicode characters. Sometimes, one mapping is old/outdated, but this is not always the case. Not sure what I can do about that. 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite inconsistently; different browsers do different things for the same encoding, and the same browser gives analogous encodings different treatment. (For the early ISO-8859-* encodings, the IANA registry points to RFC 1345, which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really seem to regard this feature as an essential part of the character set: the charset is often coded with both graphical and control character sets. If the coded character set is a 96-character set, it is tabled with the relevant GL set (normally ISO-IR-6) and with ISO 6429 as C0 and C1 As for the Windows-* encodings, Microsoft documentation treats bytes in this range as unassigned unless they are mapped to graphical characters, whereas Microsoft products return the underlying byte value in this case.) I think the HTML5 spec does what is necessary here, but it may be that the encodings specs are vague still. 5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former is probably more reasonable (assuming that letters are more important than line-drawing characters), but neither is actually correct given that the encodings are, strictly speaking, incompatible. This issue will of course look a bit different if it can be shown that documents containing the letter Ў/ў (only in KOI8-RU) are frequently mislabelled as KOI8-U. I guess we'll see what feedback we get on this when testing begins. Cheers, -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On 22 May 2008, at 12:40, Ian Hickson wrote: would you say that what the spec says now is what browsers implement? What should we change? The current table seems to cover the mappings between different common compatible 8-bit encodings as implemented in IE7, yes. The table at http://coq.no/character-tables/mime/en gives a bit more detail, most of which is better kept outside HTML5 itself. However, the following observations can be made: 1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252. IE7, on the other hand, simply ignores the high bit (as it does for a few other 7-bit encodings, by the way). Perhaps this alias could be dropped from the other browsers. 2. Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as per HTML5), whereas Safari seems to do the same for text/plain; charset=ISO-8859-11 instead [Version 3.1.2 (5525.20.1)]. Bug? 3. For certain character sets, different browsers map to different, but visually similar Unicode characters. Sometimes, one mapping is old/outdated, but this is not always the case. 4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite inconsistently; different browsers do different things for the same encoding, and the same browser gives analogous encodings different treatment. (For the early ISO-8859-* encodings, the IANA registry points to RFC 1345, which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really seem to regard this feature as an essential part of the character set: the charset is often coded with both graphical and control character sets. If the coded character set is a 96-character set, it is tabled with the relevant GL set (normally ISO-IR-6) and with ISO 6429 as C0 and C1 As for the Windows-* encodings, Microsoft documentation treats bytes in this range as unassigned unless they are mapped to graphical characters, whereas Microsoft products return the underlying byte value in this case.) 5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former is probably more reasonable (assuming that letters are more important than line-drawing characters), but neither is actually correct given that the encodings are, strictly speaking, incompatible. This issue will of course look a bit different if it can be shown that documents containing the letter Ў/ў (only in KOI8-RU) are frequently mislabelled as KOI8-U. Do you have input on the EUC-JP issue? Not yet, but you can expect some input on CJK encodings at some point in the future. -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
On Thu, 13 Mar 2008, �istein E. Andersen wrote: On 5th June 2007, Øistein E. Andersen wrote: (To do this properly, what we really ought to do is look for C1 and undefined characters in all IANA charsets and semi-official mappings to Unicode and check 1) whether the gaps can be filled by borrowing from other encodings, and 2) whether browsers actually do so. [...]) I have finally got round to looking at superset encodings. To do this, I started with Unicode mappings from [UNI] for 8-bit 1-byte alphabet encodings and added mappings for other such encodings implemented in Opera, Safari or Firefox, mostly from [CSETS], though I made one for Windows-Sami-2 from a PDF. (I then discovered that IE had something called Arabic-ASMO, for which no matching specification could be found, and subsequently reverse-engineered all IE's encodings. Most of these turned out to be identical to other mappings or only add characters from the PUA, but some real differences were found, and those are reported in the text below.) [UNI] http://unicode.org/Public/MAPPINGS/ [CSETS] http://crl.nmsu.edu/~mleisher/csets.html All the character repertoires and encoding vectors defined by the mappings were then compared pairwise. (Codepoints mapped to C0, space, BS or C1 were treated as unassigned, and directionality indicators for Arabic and Hebrew were ignored.) The result is quite a big and unreadable table [FULL], so the repertoires and encodings were clustered, which gave rise to the tables in [ENC], which compare charsets with less than 27 incompatible codepoints, as well as those in [REP], which compare charsets with at most 60 characters not found in both repertoires. (The thresholds are arbitrary, but more than sufficiently large to assure that all related charsets will be clustered together and at the sime time sufficiently small to keep the tables at a reasonable size.) [FULL] http://coq.no/X/charset-table.html [ENC] http://coq.no/X/charset-enc.html [REP] http://coq.no/X/charset-rep.html A short summary of the most interesting/relevant results (supported by [ENC]) can be found below. This is quite amazing data, thank you. I'm not sure what to do with it, frankly. Given your familiarity with the topic, would you say that what the spec says now is what browsers implement? What should we change? Do you have input on the EUC-JP issue? PS: How should colour be added to tables like these in HTML5 with neither of the attributes bgcolor and style? Class attribute and external stylesheets. (Possibly a data-* attribute.) Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from ISO-2022-JP. This is something to keep in mind when looking at multi-byte encodings. What should we say about this? (TC)VN5712-2 (TC)VN5712-1 Opera and Firefox seem to have implemented the superset only. Should we require this mapping? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Krzysztof Żelechowski wrote: Some characters, like digits, are direction-transparent [...] Inserting an LTR mark before them makes them LTR. Thanks. I would have preferred a solution which did not involve inserting extraneous characters, but I have now added LTR marks to fix the rendering. -- Øistein E. Andersen
Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]
Dnia 13-03-2008, Cz o godzinie 02:04 +0100, Øistein E. Andersen pisze: PPS: Some right-to-left characters contaminate surrounding characters as I have not yet found a simple solution to make everything strictly left-to-right (probably because I have not looked for it properly). Some characters, like digits, are direction-transparent, they inherit direction from the preceding text. Inserting an LTR mark before them makes them LTR. Chris