Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Øistein E . Andersen Fri, 17 Jul 2009 17:29:14 -0700

On 7 Jul 2009, at 09:25, Ian Hickson wrote:

On Tue, 9 Jun 2009, Anne van Kesteren wrote:

[S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does
not appear in the IANA registry.)


I've added this mapping too, just in case.

Added x-sjis. What are the other mappings that would be good?

Potentially quite a few... The following do not appear in the IANAregistry and seem to be supported in IE as well as in at least two ofthe three browsers Safari, Firefox and Opera.


Aliases for EUC-CN or GB2312-80, ultimately mapping to GBK:
- EUC-CN
- x-euc-cn
- CN-GB
- csGB231280

Alias for EUC-JP:
- X-EUC-JP

Aliases for Big5:
- cn-big5
- x-x-big5 (already in HTML5)

Aliases for Shift_JIS or Windows-31J (which was originally calledShift_JIS):

- x-sjis (already in HTML5)

Alias for windows-1256:
- cp1256

Name and alias for windows-874 (which does not seem to appear in theIANA registry):

- windows-874
- DOS-874

In addition, the following legacy Macintosh encodings enjoy universalsupport (IE, Safari, Firefox, Opera), but do not appear in the IANAregistry:

- x-mac-icelandic
- x-mac-arabic (somewhat incomplete implementation in IE)
- x-mac-ce (Central-European)
- x-mac-croatian
- x-mac-romanian
- x-mac-cyrillic
- x-mac-ukrainian
- x-mac-greek
- x-mac-turkish

Windows-932 is not supported in IE7 and may not be necessary; othersshould probably be added if windows-932 is deemed necessary.

I've split the table in two to avoid this issue.

It looks much better now. (The terminology is perhaps slightlyinconsistent, but that can be fixed later.)

Earlier, you wrote:


GB2312 and GB_2312-80 technically refer to the *character set* GB
2312-80, [...]. GBK, on the other hand, is an encoding.

As far as I can tell, GB2312 and GB_2312-80 are two differentencodings

according to IANA.


Indeed.

The following CJK character sets are listed as encodings in the IANAregistry:

- JIS_C6226-1978
- JIS_C6226-1983
- JIS_X0212-1990
- GB_2312-80
- KS_C_5601-1987

All these character sets are defined as a 94x94 matrix with rows andcolumns numbered from 1 to 94 (inclusive). According to RFC1345, acharacter is to be encoded as the two-byte sequence (row number + 32),(column number + 32) in the eponymous encoding. (The two-bytesequences are thus the same as in an ISO-2022 encoding, but only onecharacter set is available, and there are no escape sequences oranything remotely similar.)

In addition, GB_2312, which is really GB_2312-80 with the yearomitted, has been defined as what is properly known as EUC-CN.

JIS_C6226-1978, JIS_C6226-1983 and JIS_X0212-1990 do not seem to besupported in browsers at all. Both GB_2312-80 and GB_2312 are takento mean GBK, which is a superset of EUC-CN. KS_C_5601-1987 is takento mean windows-949, a superset of EUC-KR, in Safari, Firefox andOpera (IE treats it as the union of windows-949 and ISO-2022-KR, whichmay or may not be needed for compatibility).

This is all quite confusing, and what is called GB_2312 in IANA reallyshould be renamed to EUC-CN (keeping GB_2312 as an alias). The HTML5tables are now technically correct (provided that the encoding namesbe interpreted strictly according to the IANA registry).

Very minor detail: The capitalisation of Windows/windows isinconsistent in the IANA registry; you would have to write, e.g.,windows-932 and Windows-31J to follow IANA.



Other character encoding issues:
--------------------------------

ASCII-compatibility:

The note in ‘2.1.5 Character encodings’ seems to say that ‘variants ofISO-2022’ (presumably including common ones like ISO-2022-CN,ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312is not, and I cannot find anything in Section 2.1.5 that would explainthis difference.



Discouraged encodings:

‘4.2.5.5 Specifying the document's character encoding’ advises againstcertain encodings. (Incidentally, this advice probably deserves notto be ‘hidden’ in a section nominally reserved for character encoding*declaration* issues.) In particular:

Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212(JIS_X0212-1990), encodings based on ISO-2022, and encodings basedon EBCDIC.

It is not clear what this means (e.g., the character setJIS_C6226-1983 in any encoding, or only when encoded alone accordingto RFC1345 as described above); the list of discouraged encodingsseems conspicuously short if it is supposed to be complete; and thelack of rationale makes it difficult to understand why these encodingsare considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978or ISO-2022 v. HZ, to mention but two at least initially puzzlingcases). It might be better to say *why* particular encodings arebetter avoided, whether or not the list of discouraged encodings bepresented as definitive.


Minor grammar detail in 4.2.5.5:

Conformance checkers may advise against authors using legacyencodings.

This is ambiguous. It should probably be ‘advise against authors’using legacy encodings’ or better ‘advise authors against usinglegacy encodings’.


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Reply via email to