On 7 Jul 2009, at 09:25, Ian Hickson wrote:

On Tue, 9 Jun 2009, Anne van Kesteren wrote:
[S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does
not appear in the IANA registry.)

I've added this mapping too, just in case.

Added x-sjis. What are the other mappings that would be good?

Potentially quite a few... The following do not appear in the IANA registry and seem to be supported in IE as well as in at least two of the three browsers Safari, Firefox and Opera.

Aliases for EUC-CN or GB2312-80, ultimately mapping to GBK:
- EUC-CN
- x-euc-cn
- CN-GB
- csGB231280

Alias for EUC-JP:
- X-EUC-JP

Aliases for Big5:
- cn-big5
- x-x-big5 (already in HTML5)

Aliases for Shift_JIS or Windows-31J (which was originally called Shift_JIS):
- x-sjis (already in HTML5)

Alias for windows-1256:
- cp1256

Name and alias for windows-874 (which does not seem to appear in the IANA registry):
- windows-874
- DOS-874

In addition, the following legacy Macintosh encodings enjoy universal support (IE, Safari, Firefox, Opera), but do not appear in the IANA registry:
- x-mac-icelandic
- x-mac-arabic (somewhat incomplete implementation in IE)
- x-mac-ce (Central-European)
- x-mac-croatian
- x-mac-romanian
- x-mac-cyrillic
- x-mac-ukrainian
- x-mac-greek
- x-mac-turkish

Windows-932 is not supported in IE7 and may not be necessary; others should probably be added if windows-932 is deemed necessary.


I've split the table in two to avoid this issue.

It looks much better now. (The terminology is perhaps slightly inconsistent, but that can be fixed later.)


Earlier, you wrote:

GB2312 and GB_2312-80 technically refer to the *character set* GB
2312-80, [...]. GBK, on the other hand, is an encoding.

As far as I can tell, GB2312 and GB_2312-80 are two different encodings
according to IANA.

Indeed.

The following CJK character sets are listed as encodings in the IANA registry:
- JIS_C6226-1978
- JIS_C6226-1983
- JIS_X0212-1990
- GB_2312-80
- KS_C_5601-1987

All these character sets are defined as a 94x94 matrix with rows and columns numbered from 1 to 94 (inclusive). According to RFC1345, a character is to be encoded as the two-byte sequence (row number + 32), (column number + 32) in the eponymous encoding. (The two-byte sequences are thus the same as in an ISO-2022 encoding, but only one character set is available, and there are no escape sequences or anything remotely similar.)

In addition, GB_2312, which is really GB_2312-80 with the year omitted, has been defined as what is properly known as EUC-CN.

JIS_C6226-1978, JIS_C6226-1983 and JIS_X0212-1990 do not seem to be supported in browsers at all. Both GB_2312-80 and GB_2312 are taken to mean GBK, which is a superset of EUC-CN. KS_C_5601-1987 is taken to mean windows-949, a superset of EUC-KR, in Safari, Firefox and Opera (IE treats it as the union of windows-949 and ISO-2022-KR, which may or may not be needed for compatibility).

This is all quite confusing, and what is called GB_2312 in IANA really should be renamed to EUC-CN (keeping GB_2312 as an alias). The HTML5 tables are now technically correct (provided that the encoding names be interpreted strictly according to the IANA registry).

Very minor detail: The capitalisation of Windows/windows is inconsistent in the IANA registry; you would have to write, e.g., windows-932 and Windows-31J to follow IANA.


Other character encoding issues:
--------------------------------

ASCII-compatibility:
The note in ‘2.1.5 Character encodings’ seems to say that ‘variants of ISO-2022’ (presumably including common ones like ISO-2022-CN, ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot find anything in Section 2.1.5 that would explain this difference.


Discouraged encodings:
‘4.2.5.5 Specifying the document's character encoding’ advises against certain encodings. (Incidentally, this advice probably deserves not to be ‘hidden’ in a section nominally reserved for character encoding *declaration* issues.) In particular:

Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on EBCDIC.

It is not clear what this means (e.g., the character set JIS_C6226-1983 in any encoding, or only when encoded alone according to RFC1345 as described above); the list of discouraged encodings seems conspicuously short if it is supposed to be complete; and the lack of rationale makes it difficult to understand why these encodings are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two at least initially puzzling cases). It might be better to say *why* particular encodings are better avoided, whether or not the list of discouraged encodings be presented as definitive.

Minor grammar detail in 4.2.5.5:
Conformance checkers may advise against authors using legacy encodings.

This is ambiguous. It should probably be ‘advise against authors’ using legacy encodings’ or better ‘advise authors against using legacy encodings’.

--
Øistein E. Andersen

Reply via email to