Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-23 Thread NARUSE, Yui


Ian Hickson wrote:
 Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on
 EBCDIC.
 It is not clear what this means (e.g., the character set JIS_C6226-1983 in
 any encoding, or only when encoded alone according to RFC1345 as described
 above); 
 
 This is talking about character encodings, not character sets. 
 JIS_C6226-1983 is a registered character encoding in the IANA registry.

Yes, I can understand this, but...

 On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 
 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based 
 on EBCDIC.
 First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover 
 those correct names as spec are JIS X 0208 and JIS X 0212.
 
 On Thu, 22 Oct 2009, �istein E. Andersen wrote:
 I am not sure what you mean; they are both listed at
 http://www.iana.org/assignments/character-sets:

 Name: JIS_C6226-1983 [RFC1345,KXS2]
 MIBenum: 63
 Source: ECMA registry
 Alias: iso-ir-87
 Alias: x0208
 Alias: JIS_X0208-1983
 Alias: csISO87JISX0208

 Name: JIS_X0212-1990 [RFC1345,KXS2]
 MIBenum: 98
 Source: ECMA registry
 Alias: x0212
 Alias: iso-ir-159
 Alias: csISO159JISX02121990
 
 On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 Where is the word JIS-X-0208 ?
 Where is the word JIS-X-0212 ?
 
 The exact string isn't there, that's why I included the preferred MIME 
 names in brackets in the spec.

if it is talking about character encodings,
why it uses the name of character sets mainly?
Following seems better.

 Authors should not use JIS_C6226-1983, JIS_X0212-1990,
 encodings based on ISO-2022, and encodings based 

 On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
 ASCII compatible. So they are out of discouraged; mustn't use.
 
 You can use non-ASCII-compatible encodings (e.g. UTF-16).

I see.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-23 Thread Ian Hickson
On Fri, 23 Oct 2009, NARUSE, Yui wrote:
  
  The exact string isn't there, that's why I included the preferred MIME 
  names in brackets in the spec.
 
 if it is talking about character encodings,
 why it uses the name of character sets mainly?
 Following seems better.

  Authors should not use JIS_C6226-1983, JIS_X0212-1990,
  encodings based on ISO-2022, and encodings based 

Ok, done.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-23 Thread Øistein E . Andersen

On 23 Oct 2009, at 04:20, Ian Hickson wrote:


On Wed, 21 Oct 2009, Øistein E. Andersen wrote:





ASCII-compatibility:
The note in ‘2.1.5 Character encodings’ seems to say that [...]
ISO-2022’[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and  
I cannot

find anything in Section 2.1.5 that would explain this difference.


HZ-GB-2312 uses the byte ASCII uses for ~ as the escape character.
ISO-2022-* uses the control codes. That's the difference.


'~'/0x7E is not (and should not be, as far as I can tell) relevant for  
HTML5's concept of ASCII compatibility.



Discouraged encodings: [...]


Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
(JIS_X0212-1990), [...]


It is not clear what this means [...]


This is talking about character encodings, not character sets.
JIS_C6226-1983 is a registered character encoding in the IANA  
registry.


(This is less confusing now since HTML5 only deals with character  
encodings and the strings match those in the the IANA registry as  
suggested by Yui Naruse.)



the list of discouraged encodings seems conspicuously short if it is
supposed to be complete; and the lack of rationale makes it  
difficult to

understand why these encodings are considered particularly harmful
(JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but  
two

at least initially puzzling cases).


The reason for including these is to discourage encodings known to  
have
security issues. I've added HZ-GB-2312, which can be used in a  
similarly
dangerous fashion. (Basically the danger for user agents is in an  
attacker

using an encoding that a user agent could autodetect, while a site
interprets the bytes safely; that would allow those encodings to be  
used

to smuggle script elements in a way that a naive whitelisting filter
would think is safe.)

It might be better to say *why* particular encodings are better  
avoided,

whether or not the list of discouraged encodings be presented as
definitive.


I've added a note.

[...]

On Thu, 22 Oct 2009, Philip Taylor wrote:


The string [숍訊昱穿] encoded as ISO-2022-KR is the bytes 0e  
3c 73
63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g.  
Chrome,
when I last checked) will decode it as Windows-1252 and get the  
string
script, which is bad. So a site that uses ISO-2022-KR is very  
likely
to expose some users to XSS attacks, which seems like a good reason  
to
discourage that encoding. The same applies to other ISO-2022  
encodings.


[...]

On Thu, 22 Oct 2009, Øistein E. Andersen wrote:


If that is the reason, at least HZ encoding would seem to be  
affected as

well. Explicitly discouraging a more or less random subset of the
problematic encdodings without providing rationale makes it  
difficult to

assess whether or not other, somewhat similar, encodings should be
avoided as well, which was the main issue I wanted to raise.


Hopefully this is somewhat addressed now.



The added note certainly helps, but it is vague (does [m]ost of these  
encodings mean all the encodings mentioned above apart from  
UTF-32?) and inaccurate (Philip Taylor's example does not rely on  
bugs).


Given that the set of encodings is open-ended, I still think it would  
be preferable to make the rationale (a definition of what makes an  
encoding problematic) primary and mention actual encodings as  
examples. This could give something like the following: Encodings in  
which a series of bytes in the range 0x20..0x7E may encode characters  
other than the corresponding characters in the range U+20..U+7E  
represent a potential security vulnerability since a browser that does  
not support the encoding (or does not support the label used to  
declare the encoding, or does not use the same mechanism to detect the  
encoding of unlabelled content) might end up interpreting technically  
benign plain text content as HTML tags and JavaScript.  In particular,  
this applies to encodings in which the bytes corresponding to  
'script' in ASCII may encode a different string. Authors should not  
use such encodings, which are known to include  In addition,  
authors should not use UTF-32  Alternatively, fixing the current  
note would help and might be sufficient, albeit not ideal.


I think one has to realise that a comprehensive list of problematic  
encodings is an elusive goal and act accordingly.


--
Øistein E. Andersen


PS: The following sentence makes little sense without (curly) quotes  
and apostrophes. In case they disappeared before you read it, please  
find it repeated below with (ASCII) quotes and apostrophes:


It should probably be ‘advise against authors'’ using legacy  
encodings

or better ‘advise authors against using legacy encodings’.


(The current text in the spec is fine.)

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-23 Thread Ian Hickson
On Fri, 23 Oct 2009, �istein E. Andersen wrote:
 On 23 Oct 2009, at 04:20, Ian Hickson wrote:
  On Wed, 21 Oct 2009, Øistein E. Andersen wrote:
  
   ASCII-compatibility:
   The note in ‘2.1.5 Character encodings’ seems to say that [...]
   ISO-2022’[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and I
   cannot
   find anything in Section 2.1.5 that would explain this difference.
  
  HZ-GB-2312 uses the byte ASCII uses for ~ as the escape character.
  ISO-2022-* uses the control codes. That's the difference.
 
 '~'/0x7E is not (and should not be, as far as I can tell) relevant for HTML5's
 concept of ASCII compatibility.

Good point. Moved the encoding over to the other side.


 The added note certainly helps, but it is vague (does [m]ost of these 
 encodings mean all the encodings mentioned above apart from UTF-32?) 
 and inaccurate (Philip Taylor's example does not rely on bugs).
 
 Given that the set of encodings is open-ended, I still think it would be 
 preferable to make the rationale (a definition of what makes an encoding 
 problematic) primary and mention actual encodings as examples. This 
 could give something like the following: Encodings in which a series of 
 bytes in the range 0x20..0x7E may encode characters other than the 
 corresponding characters in the range U+20..U+7E represent a potential 
 security vulnerability since a browser that does not support the 
 encoding (or does not support the label used to declare the encoding, or 
 does not use the same mechanism to detect the encoding of unlabelled 
 content) might end up interpreting technically benign plain text content 
 as HTML tags and JavaScript.  In particular, this applies to encodings 
 in which the bytes corresponding to 'script' in ASCII may encode a 
 different string. Authors should not use such encodings, which are known 
 to include  In addition, authors should not use UTF-32  
 Alternatively, fixing the current note would help and might be 
 sufficient, albeit not ideal.

I've reworded the spec based on your suggestion. Thanks!

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread NARUSE, Yui
Øistein E. Andersen wrote:
 Discouraged encodings:
 ‘4.2.5.5 Specifying the document's character encoding’ advises against
 certain encodings.  (Incidentally, this advice probably deserves not
 to be ‘hidden’ in a section nominally reserved for character encoding
 *declaration* issues.)  In particular:

 Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
 (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on
 EBCDIC.

First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets,
moreover those correct names as spec are JIS X 0208 and JIS X 0212.

Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
ASCII compatible. So they are out of discouraged; mustn't use.

Finally, Why ISO 2022 series is discouraged is not clear.


Anyway, most of charsets defined RFC 1345 are not clear.
Conversion table between Unicode is needed.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread Øistein E . Andersen

On 22 Oct 2009, at 17:15, NARUSE, Yui wrote:


First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets,


I am not sure what you mean; they are both listed at
http://www.iana.org/assignments/character-sets:

Name: JIS_C6226-1983 [RFC1345,KXS2]
MIBenum: 63
Source: ECMA registry
Alias: iso-ir-87
Alias: x0208
Alias: JIS_X0208-1983
Alias: csISO87JISX0208

Name: JIS_X0212-1990 [RFC1345,KXS2]
MIBenum: 98
Source: ECMA registry
Alias: x0212
Alias: iso-ir-159
Alias: csISO159JISX02121990


moreover those correct names as spec are JIS X 0208 and JIS X 0212.


(The IANA registry is internally inconsistent and often disagrees with  
official standards when it comes to capitalisation, dashes/hyphens,  
underscores and spaces, so it is difficult to get this right. Please  
excuse me for not always paying due attention to such details in e- 
mails. Of course, the specifications should follow either IANA or the  
official standard as appropriate, depending on what it is referring to.)



Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
ASCII compatible. So they are out of discouraged; mustn't use.


EBCDIC is clearly not ASCII-compatible and may be unique amongst the  
character sets in the IANA registry in providing the full ASCII  
repertoire in a different arrangement.


JIS_C6226-1983 and JIS_X0212-1990 as defined in RFC1345 (i.e., on  
their own) do not contain basic ASCII characters at all, so it makes  
little sense to use them for HTML documents without adding ASCII or  
the ASCII-based JIS C 6220-1969, which would give something like EUC- 
JP or ISO-2022-JP.  JIS_C6226-1983 contains wide versions of ASCII  
characters, but those are not interpreted as HTML mark-up (unless I am  
mistaken). JIS_X0212-1990 does not contain ASCII, kana or basic kanji,  
so it is of extremely limited usefulness on its own even in a plain- 
text setting.  Warning against completely useless encodings seems  
pointless.


Many other encodings in the IANA registry are ASCII-incompatible in  
different ways; what I do not understand is what makes the ones  
currently mentioned in the HTML5 draft particularly harmful.



Finally, Why ISO 2022 series is discouraged is not clear.


We agree on this point.


Anyway, most of charsets defined RFC 1345 are not clear.
Conversion table between [those charsets and] Unicode is needed.


Quite.  Anne van Kesteren, I and several others are currently trying  
to document how browsers handle different encodings at
http://wiki.whatwg.org/wiki/Web_Encodings, and defining mappings to  
Unicode is one of the goals.  Your contribution would be much  
appreciated.


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread NARUSE, Yui


Øistein E. Andersen wrote:
 On 22 Oct 2009, at 17:15, NARUSE, Yui wrote:
 
 First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets,
 
 I am not sure what you mean; they are both listed at
 http://www.iana.org/assignments/character-sets:
 
 Name: JIS_C6226-1983 [RFC1345,KXS2]
 MIBenum: 63
 Source: ECMA registry
 Alias: iso-ir-87
 Alias: x0208
 Alias: JIS_X0208-1983
 Alias: csISO87JISX0208

Where is the word JIS-X-0208 ?

 Name: JIS_X0212-1990 [RFC1345,KXS2]
 MIBenum: 98
 Source: ECMA registry
 Alias: x0212
 Alias: iso-ir-159
 Alias: csISO159JISX02121990

Where is the word JIS-X-0212 ?

 moreover those correct names as spec are JIS X 0208 and JIS X 0212.
 
 Please
 excuse me for not always paying due attention to such details in
 e-mails. Of course, the specifications should follow either IANA or the
 official standard as appropriate, depending on what it is referring to.)

Not for you, this sentense is in current HTML5 Draft 4.2.5.5.
That is why I paid attention.

 Anyway, most of charsets defined RFC 1345 are not clear.
 Conversion table between [those charsets and] Unicode is needed.
 
 Quite.  Anne van Kesteren, I and several others are currently trying to
 document how browsers handle different encodings at
 http://wiki.whatwg.org/wiki/Web_Encodings, and defining mappings to
 Unicode is one of the goals.  Your contribution would be much appreciated.

ICU has large set of tables which likely to cover many MS Codepages.
(Of course it should be verified)
http://bugs.icu-project.org/trac/browser/data/trunk/charset/data/ucm

And I have a CP51932 table made from .NET Framework's Coonverter.
http://nkf.sourceforge.jp/ucm/cp51932.ucm

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread Philip Taylor
On Thu, Oct 22, 2009 at 9:23 PM, Øistein E. Andersen li...@coq.no wrote:
 On 22 Oct 2009, at 17:15, NARUSE, Yui wrote:

 Finally, Why ISO 2022 series is discouraged is not clear.

 We agree on this point.

The string 숍訊昱穿 encoded as ISO-2022-KR is the bytes 0e 3c 73  63 72
69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, when
I last checked) will decode it as Windows-1252 and get the string
script, which is bad. So a site that uses ISO-2022-KR is very
likely to expose some users to XSS attacks, which seems like a good
reason to discourage that encoding. The same applies to other ISO-2022
encodings.

-- 
Philip Taylor
exc...@gmail.com


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread Øistein E . Andersen

On 22 Oct 2009, at 22:45, Philip Taylor wrote:
On Thu, Oct 22, 2009 at 9:23 PM, Øistein E. Andersen li...@coq.no  
wrote:

On 22 Oct 2009, at 17:15, NARUSE, Yui wrote:

Finally, Why ISO 2022 series is discouraged is not clear.

We agree on this point.
The string 숍訊昱穿 encoded as ISO-2022-KR is the bytes 0e 3c  
73  63 72

69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, when
I last checked) will decode it as Windows-1252 and get the string
script, which is bad. [...]


If that is the reason, at least HZ encoding would seem to be affected  
as well.  Explicitly discouraging a more or less random subset of the  
problematic encdodings without providing rationale makes it difficult  
to assess whether or not other, somewhat similar, encodings should be  
avoided as well, which was the main issue I wanted to raise.


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-22 Thread Ian Hickson
On Wed, 21 Oct 2009, �istein E. Andersen wrote:
 
 ASCII-compatibility:
 The note in �2.1.5 Character encodings� seems to say that �variants of
 ISO-2022� (presumably including common ones like ISO-2022-CN, ISO-2022KR and
 ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot
 find anything in Section 2.1.5 that would explain this difference.

HZ-GB-2312 uses the byte ASCII uses for ~ as the escape character. 
ISO-2022-* uses the control codes. That's the difference.


 Discouraged encodings:
 �4.2.5.5 Specifying the document's character encoding� advises against
 certain encodings. In particular:
 
  Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
  (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on
  EBCDIC.
 
 It is not clear what this means (e.g., the character set JIS_C6226-1983 in
 any encoding, or only when encoded alone according to RFC1345 as described
 above); 

This is talking about character encodings, not character sets. 
JIS_C6226-1983 is a registered character encoding in the IANA registry.


 the list of discouraged encodings seems conspicuously short if it is 
 supposed to be complete; and the lack of rationale makes it difficult to 
 understand why these encodings are considered particularly harmful 
 (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two 
 at least initially puzzling cases).

The reason for including these is to discourage encodings known to have 
security issues. I've added HZ-GB-2312, which can be used in a similarly 
dangerous fashion. (Basically the danger for user agents is in an attacker 
using an encoding that a user agent could autodetect, while a site 
interprets the bytes safely; that would allow those encodings to be used 
to smuggle script elements in a way that a naive whitelisting filter 
would think is safe.)


 It might be better to say *why* particular encodings are better avoided, 
 whether or not the list of discouraged encodings be presented as 
 definitive.

I've added a note.


 (Incidentally, this advice probably deserves not to be �hidden� in a 
 section nominally reserved for character encoding *declaration* issues.)

Yeah. I considered moving it to the Writing HTML documents section, but 
that one doesn't apply to conformance checkers, so it ends up being more 
of a pain, since the advice would have to be split into multiple pieces so 
that it applied appropriately. It's not a big deal.


 Minor grammar detail in 4.2.5.5:
  Conformance checkers may advise against authors using legacy encodings.
 
 This is ambiguous.  It should probably be �advise against authors� using
 legacy encodings�  or better �advise authors against using legacy
 encodings�.

Fixed.


On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 
  Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 
  (JIS_X0212-1990), encodings based on ISO-2022, and encodings based 
  on EBCDIC.
 
 First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover 
 those correct names as spec are JIS X 0208 and JIS X 0212.

On Thu, 22 Oct 2009, �istein E. Andersen wrote:

 I am not sure what you mean; they are both listed at
 http://www.iana.org/assignments/character-sets:
 
 Name: JIS_C6226-1983 [RFC1345,KXS2]
 MIBenum: 63
 Source: ECMA registry
 Alias: iso-ir-87
 Alias: x0208
 Alias: JIS_X0208-1983
 Alias: csISO87JISX0208
 
 Name: JIS_X0212-1990 [RFC1345,KXS2]
 MIBenum: 98
 Source: ECMA registry
 Alias: x0212
 Alias: iso-ir-159
 Alias: csISO159JISX02121990

On Fri, 23 Oct 2009, NARUSE, Yui wrote:
 
 Where is the word JIS-X-0208 ?
 Where is the word JIS-X-0212 ?

The exact string isn't there, that's why I included the preferred MIME 
names in brackets in the spec.


On Fri, 23 Oct 2009, NARUSE, Yui wrote:

 Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not
 ASCII compatible. So they are out of discouraged; mustn't use.

You can use non-ASCII-compatible encodings (e.g. UTF-16).


 Finally, Why ISO 2022 series is discouraged is not clear.

Hopefully this is clear now.


 Anyway, most of charsets defined RFC 1345 are not clear.
 Conversion table between Unicode is needed.

On Thu, 22 Oct 2009, �istein E. Andersen wrote:
 
  moreover those correct names as spec are JIS X 0208 and JIS X 0212.
 
 (The IANA registry is internally inconsistent and often disagrees with 
 official standards when it comes to capitalisation, dashes/hyphens, 
 underscores and spaces, so it is difficult to get this right. Please 
 excuse me for not always paying due attention to such details in 
 e-mails. Of course, the specifications should follow either IANA or the 
 official standard as appropriate, depending on what it is referring to.)
 
  Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII 
  compatible. So they are out of discouraged; mustn't use.
 
 EBCDIC is clearly not ASCII-compatible and may be unique amongst the 
 character sets in the IANA 

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-21 Thread Øistein E . Andersen

On 19 Oct 2009, at 05:52, Ian Hickson wrote:

I've noted your e-mail here [...] and moved the whole thing out of  
the spec.


That does not seem to apply to the last part of the original e-mail,  
quoted below.


Øistein E. Andersen




Other character encoding issues:


ASCII-compatibility:
The note in ‘2.1.5 Character encodings’ seems to say that ‘variants  
of ISO-2022’ (presumably including common ones like ISO-2022-CN,  
ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312  
is not, and I cannot find anything in Section 2.1.5 that would  
explain this difference.



Discouraged encodings:
‘4.2.5.5 Specifying the document's character encoding’ advises  
against certain encodings.  (Incidentally, this advice probably  
deserves not to be ‘hidden’ in a section nominally reserved for  
character encoding *declaration* issues.)  In particular:


Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212  
(JIS_X0212-1990), encodings based on ISO-2022, and encodings based  
on EBCDIC.


It is not clear what this means (e.g., the character set  
JIS_C6226-1983 in any encoding, or only when encoded alone according  
to RFC1345 as described above); the list of discouraged encodings  
seems conspicuously short if it is supposed to be complete; and the  
lack of rationale makes it difficult to understand why these  
encodings are considered particularly harmful (JIS_C6226-1983 v.  
JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two at least  
initially puzzling cases).  It might be better to say *why*  
particular encodings are better avoided, whether or not the list of  
discouraged encodings be presented as definitive.


Minor grammar detail in 4.2.5.5:
Conformance checkers may advise against authors using legacy  
encodings.


This is ambiguous.  It should probably be ‘advise against authors’  
using legacy encodings’  or better ‘advise authors against using  
legacy encodings’.


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-10-18 Thread Ian Hickson
On Sat, 18 Jul 2009, Øistein E. Andersen wrote:
 On 7 Jul 2009, at 09:25, Ian Hickson wrote:
  On Tue, 9 Jun 2009, Anne van Kesteren wrote:
   [S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does
   not appear in the IANA registry.)
  
  I've added this mapping too, just in case.
 
  Added x-sjis. What are the other mappings that would be good?
 
 Potentially quite a few...  The following do not appear in the IANA registry
 and seem to be supported in IE as well as in at least two of the three
 browsers Safari, Firefox and Opera. [...]

I've noted your e-mail here:

   http://wiki.whatwg.org/wiki/Web_Encodings#E-mails

...and moved the whole thing out of the spec. I think the conclusion is 
that we should just do this using IANA aliases.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-07-17 Thread Øistein E . Andersen

On 7 Jul 2009, at 09:25, Ian Hickson wrote:


On Tue, 9 Jun 2009, Anne van Kesteren wrote:

[S]hould HTML5 mention that Windows-932 maps to Windows-31J? (It does
not appear in the IANA registry.)


I've added this mapping too, just in case.



Added x-sjis. What are the other mappings that would be good?


Potentially quite a few...  The following do not appear in the IANA  
registry and seem to be supported in IE as well as in at least two of  
the three browsers Safari, Firefox and Opera.


Aliases for EUC-CN or GB2312-80, ultimately mapping to GBK:
- EUC-CN
- x-euc-cn
- CN-GB
- csGB231280

Alias for EUC-JP:
- X-EUC-JP

Aliases for Big5:
- cn-big5
- x-x-big5 (already in HTML5)

Aliases for Shift_JIS or Windows-31J (which was originally called  
Shift_JIS):

- x-sjis (already in HTML5)

Alias for windows-1256:
- cp1256

Name and alias for windows-874 (which does not seem to appear in the  
IANA registry):

- windows-874
- DOS-874

In addition, the following legacy Macintosh encodings enjoy universal  
support (IE, Safari, Firefox, Opera), but do not appear in the IANA  
registry:

- x-mac-icelandic
- x-mac-arabic (somewhat incomplete implementation in IE)
- x-mac-ce (Central-European)
- x-mac-croatian
- x-mac-romanian
- x-mac-cyrillic
- x-mac-ukrainian
- x-mac-greek
- x-mac-turkish

Windows-932 is not supported in IE7 and may not be necessary; others  
should probably be added if windows-932 is deemed necessary.




I've split the table in two to avoid this issue.


It looks much better now.  (The terminology is perhaps slightly  
inconsistent, but that can be fixed later.)




Earlier, you wrote:


GB2312 and GB_2312-80 technically refer to the *character set* GB
2312-80, [...]. GBK, on the other hand, is an encoding.


As far as I can tell, GB2312 and GB_2312-80 are two different  
encodings

according to IANA.


Indeed.

The following CJK character sets are listed as encodings in the IANA  
registry:

- JIS_C6226-1978
- JIS_C6226-1983
- JIS_X0212-1990
- GB_2312-80
- KS_C_5601-1987

All these character sets are defined as a 94x94 matrix with rows and  
columns numbered from 1 to 94 (inclusive). According to RFC1345, a  
character is to be encoded as the two-byte sequence (row number + 32),  
(column number + 32) in the eponymous encoding. (The two-byte  
sequences are thus the same as in an ISO-2022 encoding, but only one  
character set is available, and there are no escape sequences or  
anything remotely similar.)


In addition, GB_2312, which is really GB_2312-80 with the year  
omitted, has been defined as what is properly known as EUC-CN.


JIS_C6226-1978, JIS_C6226-1983 and JIS_X0212-1990 do not seem to be  
supported in browsers at all.  Both GB_2312-80 and GB_2312 are taken  
to mean GBK, which is a superset of EUC-CN.  KS_C_5601-1987 is taken  
to mean windows-949, a superset of EUC-KR, in Safari, Firefox and  
Opera (IE treats it as the union of windows-949 and ISO-2022-KR, which  
may or may not be needed for compatibility).


This is all quite confusing, and what is called GB_2312 in IANA really  
should be renamed to EUC-CN (keeping GB_2312 as an alias).  The HTML5  
tables are now technically correct (provided that the encoding names  
be interpreted strictly according to the IANA registry).


Very minor detail:  The capitalisation of Windows/windows is  
inconsistent in the IANA registry; you would have to write, e.g.,  
windows-932 and Windows-31J  to follow IANA.



Other character encoding issues:


ASCII-compatibility:
The note in ‘2.1.5 Character encodings’ seems to say that ‘variants of  
ISO-2022’ (presumably including common ones like ISO-2022-CN,  
ISO-2022KR and ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312  
is not, and I cannot find anything in Section 2.1.5 that would explain  
this difference.



Discouraged encodings:
‘4.2.5.5 Specifying the document's character encoding’ advises against  
certain encodings.  (Incidentally, this advice probably deserves not  
to be ‘hidden’ in a section nominally reserved for character encoding  
*declaration* issues.)  In particular:


Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212  
(JIS_X0212-1990), encodings based on ISO-2022, and encodings based  
on EBCDIC.


It is not clear what this means (e.g., the character set  
JIS_C6226-1983 in any encoding, or only when encoded alone according  
to RFC1345 as described above); the list of discouraged encodings  
seems conspicuously short if it is supposed to be complete; and the  
lack of rationale makes it difficult to understand why these encodings  
are considered particularly harmful (JIS_C6226-1983 v. JIS_C6226-1978  
or ISO-2022 v. HZ, to mention but two at least initially puzzling  
cases).  It might be better to say *why* particular encodings are  
better avoided, whether or not the list of discouraged encodings be  
presented as definitive.


Minor grammar detail in 4.2.5.5:
Conformance checkers may advise against authors 

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-07-07 Thread Ian Hickson
On Tue, 9 Jun 2009, Anne van Kesteren wrote:
 On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen li...@coq.no wrote:
  Le 5 juin 09, Anne van Kesteren écrivit :
 
  Is the implication here that Shift_JIS and Shift-JIS are distinct 
  [...]?
 
  No, Shift-JIS and Windows-932 are commonly used names/labels for the 
  encodings that are registered as Shift_JIS and Windows-31J 
  (respectively) in the IANA charset registry. Sorry for the confusion 
  caused.
 
 So should HTML5 mention that Windows-932 maps to Windows-31J? (It does 
 not appear in the IANA registry.)

I've added this mapping too, just in case.


On Tue, 9 Jun 2009, �istein E. Andersen wrote:
 
 That is an interesting question. My (apparently wrong) understanding was 
 that the table was merely supposed to provide mappings between 
 encodings, since such mappings are inappropriate in non-HTML contexts 
 and cannot be added to the IANA registry. It might be to useful to 
 include a set of MIME charset strings which cannot be or have not yet 
 been registered (e.g., x-x-big5, x-sjis, windows-932) as well as 
 information on how CJK character sets are implemented in practice, both 
 of which seem to be necessary for compatibility.
 
 Such information does not fit comfortably in the current table, though.

Added x-sjis. What are the other mappings that would be good?


On Tue, 9 Jun 2009, �istein E. Andersen wrote:
  
  I believe you misunderstand the purpose of this table. The idea is to 
  give a mapping of _labels_ to encodings, not encodings to encodings. 
  I've clarified the text to this effect.
 
 You seem to have added specified by a label to the phrase which now 
 reads an encoding specified by a label given in the first column of the 
 following table without changing the column heading (Input encoding) 
 and without defining what a label actually is. The reference to 
 encoding aliasing is also intact, which seems misleading if the table 
 is not supposed to map between encodings.

I've split the table in two to avoid this issue.


Earlier, you wrote:

 GB2312 and GB_2312-80 technically refer to the *character set* GB 
 2312-80, [...]. GBK, on the other hand, is an encoding.

As far as I can tell, GB2312 and GB_2312-80 are two different encodings 
according to IANA.


On Wed, 10 Jun 2009, Anne van Kesteren wrote:
 
 I would prefer them being added to the IANA registry.

I've noted that I should do that.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-11 Thread Øistein E . Andersen

Le 10 juin 09 à 09:06, Anne van Kesteren a écrit :

It is about adding aliases. If the alias added is also a distinct  
encoding conformance checkers are supposed to report on the  
differences.


That probably has to be made more explicit, then.

Personally I would be happy with making the aliases normative  
everywhere but I suspect that is not going to fly. E.g. letting US- 
ASCII always map to Windows-1252 would probably be highly  
controversial.


That particular mapping may not actually be necessary (IE8 maps 8-bit  
US-ASCII to U+FFFD, and several previous versions of IE ignore the  
high bit), so making the other aliases normative still seems worth  
considering. There are a few aliases whose name starts with x-, though.



I would prefer them being added to the IANA registry.


Sure.


It might be to useful to
include a set of MIME charset strings which cannot be or have not yet
been registered (e.g., x-x-big5, x-sjis, windows-932) as well as
information on how CJK character sets are implemented in practice,  
both

of which seem to be necessary for compatibility.


Such information should definitely be included, yes.


In that case, it would probably be less confusing and more accurate to  
have one table mapping between encodings (or from preferred MIME name  
to encoding or something along those lines) and another table adding  
additional MIME charset strings.


Since you seem to have studied this subject a lot, do you keep more  
detailed information somewhere including tests, findings, tables,  
etc? It would be very cool to have that.


Most of the relevant findings have been sent to the WhatWG list as  
part of the current thread. The following messages contain links to  
tables and tests:


http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014190.html 

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-July/015455.html 

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-April/019322.html 



Some of the tables and tests may be difficult to interpret, so please  
feel free to ask if you have any questions.


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-09 Thread Øistein E . Andersen

Le 3 juin 09 à 23h19, Ian Hickson écrivit :


On Tue, 14 Apr 2009, Øistein E. Andersen wrote:


HTML5 currently contains a table of encodings aliases,
[...]
GB2312 and GB_2312-80 technically refer to the *character set* GB  
2312-80,

[...]. GBK, on the other hand, is an encoding.
[...]
There is
a large number of unregistered charset strings, however, and the  
other
mappings in this table are between encodings.  Unless x-x-big5 is  
actually
supposed to refer to an encoding distinct from Big5, [this mapping]  
should be

removed.
[...]


I believe you misunderstand the purpose of this table. The idea is  
to give

a mapping of _labels_ to encodings, not encodings to encodings. I've
clarified the text to this effect.


You seem to have added specified by a label to the phrase which now  
reads an encoding specified by a label given in the first column of  
the following table without changing the column heading (Input  
encoding) and without defining what a label actually is. The  
reference to encoding aliasing is also intact, which seems  
misleading if the table is not supposed to map between encodings.


The concept of misinterpret[ation] for compatibility seems  
inappropriate for the mapping from x-x-big5 to Big5 unless the label  
x-x-big5 is actually supposed to specify an encoding distinct from Big5.


It is not at all clear to me what you mean by label. It might be the  
MIME charset string with which the HTML document is labelled, but that  
would require an inordinate number of strings to be specified (e.g.,  
iso-ir-100, latin1 and IBM819 amongst others alongside ISO-8859-1), so  
this cannot possibly be the intended meaning. It might be a normalised  
form of the MIME charset string, using the IANA charset registry to  
map an alias to its corresponding name (or to the alias  
qualified as preferred MIME name if there is such an entry), but  
that does not quite seem to work either, since aliases not registered  
in the IANA charset registry would then not be covered by the aliasing  
mechanism (e.g., it would cause content labelled as x-sjis to be  
handled as unaugmented Shift_JIS despite the mapping from Shift_JIS to  
Windows-31J, since x-sjis does not and cannot figure in the IANA  
charset registry).


I did indeed believe that the table was supposed to map between  
encodings, and this interpretation still seems to give the correct  
result in practice for non-CJK encodings (unless, of course, content  
labelled TIS-620-2533 should actually be interpreted as TIS-620 rather  
than windows-874).



Le 9 juin 09 à 10h55, Anne van Kesteren écrivit :


On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen wrote:


Shift-JIS and Windows-932 are commonly used names/labels for the
encodings that are registered as Shift_JIS and Windows-31J



(respectively) in the IANA charset registry. [...]


So should HTML5 mention that Windows-932 maps to Windows-31J? (It  
does not appear in the IANA registry.)



That is an interesting question. My (apparently wrong) understanding  
was that the table was merely supposed to provide mappings between  
encodings, since such mappings are inappropriate in non-HTML contexts  
and cannot be added to the IANA registry. It might be to useful to  
include a set of MIME charset strings which cannot be or have not yet  
been registered (e.g., x-x-big5, x-sjis, windows-932) as well as  
information on how CJK character sets are implemented in practice,  
both of which seem to be necessary for compatibility.


Such information does not fit comfortably in the current table, though.


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-08 Thread Øistein E . Andersen

On Tue, 14 Apr 2009, Øistein E. Andersen wrote:


Shift_JIS  Windows-31J
[...]
Shift-JIS  Windows-932



Le 5 juin 09, Anne van Kesteren écrivit :

Is the implication here that Shift_JIS and Shift-JIS are distinct  
[...]?



No, Shift-JIS and Windows-932 are commonly used names/labels for the  
encodings that are registered as Shift_JIS and Windows-31J  
(respectively) in the IANA charset registry. Sorry for the confusion  
caused.


--
Øistein E. Andersen

PS: Sorry for the belated reply, partly caused by a hard-drive break- 
down while I was away.

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-05 Thread Anne van Kesteren
Is the implication here that Shift_JIS and Shift-JIS are distinct despite the 
encoding matching rules in Unicode not allowing for that? If that is the case I 
think we need new matching rules.

If the implication is something else I'd like to know.


On Thu, 04 Jun 2009 00:19:05 +0200, Ian Hickson i...@hixie.ch wrote:
 On Tue, 14 Apr 2009, Øistein E. Andersen wrote:
 [...]

 In addition, Shift_JIS  Windows-31J, and all browsers implement this  
 mapping,
 so the following should be added:
Shift_JIS   -  Windows-31J

 Added.


 [...]

 Shift-JIS encoding for Japanese
 ===

 Shift-JIS supports:
 - ASCII
 - Katakana
 - JIS X 0208-1990/1997

 All browsers furthermore supports NEC symbols as well as IBM extensions  
 in
 both NEC and IBM (Shift-JIS) positions.  This is actually Windows-932:

 Shift-JIS  Windows-932

 [...]


-- 
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-05 Thread Ian Hickson
On Fri, 5 Jun 2009, Anne van Kesteren wrote:

 Is the implication here that Shift_JIS and Shift-JIS are distinct 
 despite the encoding matching rules in Unicode not allowing for that? If 
 that is the case I think we need new matching rules.
 
 If the implication is something else I'd like to know.

I don't understand the question.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-05 Thread Ian Hickson
On Fri, 5 Jun 2009, Anne van Kesteren wrote:
 On Fri, 05 Jun 2009 10:14:46 +0200, Ian Hickson i...@hixie.ch wrote:
  On Fri, 5 Jun 2009, Anne van Kesteren wrote:
 
  Is the implication here that Shift_JIS and Shift-JIS are distinct
  despite the encoding matching rules in Unicode not allowing for that? If
  that is the case I think we need new matching rules.
 
  If the implication is something else I'd like to know.
 
  I don't understand the question.
 
 Part of my email was the data that Shift_JIS supposedly is a subset of 
 Windows-31J and Shift-JIS supposedly is a subset of Windows-932. (Note 
 the dash versus underscore.)

Ah, ok. I thought you were refering to the change I made to the spec. My 
apologies.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-03 Thread Ian Hickson

I haven't made any changes to the spec based on the feedback below. Let me 
know if there's anything I missed. I'm not aware of any specific problems 
at this time.

On Sat, 11 Apr 2009, Øistein E. Andersen wrote:

 On 22 May 2008, at 12:40, Ian Hickson wrote:
 
  Do you have input on the EUC-JP issue?
 
 I am now about to finish my analysis of CJK encodings (e-mail forthcoming),
 including EUC-JP.  This encoding does not seem to be particularly problematic,
 however.  Are you referring to a specific problem?
 
  On Thu, 13 Mar 2008, Øistein E. Andersen wrote:
   Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from
ISO-2022-JP. This is something to keep in mind when looking at
multi-byte encodings.
  
  What should we say about this?
 
 The issue seems to be that IE's implementation of ISO-2022-JP is a large
 superset of what is actually specified.  (This is the case for several other
 CJK encodings as well.)  See forthcoming e-mail for an actual description of
 the extensions.
 
   (TC)VN5712-2  (TC)VN5712-1
   
   Opera[?] and Firefox seem to have implemented the superset only.
  
  Should we require this mapping?
 
 For reference:
 (TC)VN5712-3(TC)VN5712-2 = VSCII-2 = ISO IR 180(TC)VN5712-1
 
 Only the complete set seems to be implemented (and only in Firefox), and MIME
 charset strings referring to one of the subsets do not seem to work at all, so
 no mappings are necessary.
 
 

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-03 Thread Ian Hickson
On Sun, 12 Apr 2009, Øistein E. Andersen wrote:
 On 2 Sep 2008, at 06:06, Ian Hickson wrote:
 
  On Wed, 30 Jul 2008, Øistein E. Andersen wrote:
   
   1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
  IE7, on the other hand, simply ignores the high bit (as it does for
  a few other 7-bit encodings, by the way).  Perhaps this
  alias could be dropped from the other browsers.
  
  Ignoring the high bit seems like a dangerous security bug; dropping any
  character with a high bit as U+FFFD seems unnecessarily drastic.
 
 According to a test I did using browsershots.org, IE8 actually seems to do
 this (8-bit characters are rendered as squares), which looks like an argument
 in favour of the more `drastic' option.
 
  I've made the spec go with the O/F/S behaviour here.
 
 This has the advantage of not adding ASCII as a separate encoding, and
 Windows-1252 is presumably one of the encodings most often mislabelled as
 ASCII.  However, IE has ignored the high bit at least since 5.01 (IE4 via
 browsershots.org treats it as CP1252, but this could well be
 locale-dependent), so there may not be that many mislabelled pages.  Has
 anyone got a list of pages which are labelled as ASCII and contain 8-bit
 characters?
 
 This is probably not very important.  U+FFFD is `purer', Windows-1252 has the
 potential of rescuing a few pages.  It is however essential that 8-bit
 characters be considered not conforming since they do not in fact work (as
 Windows-1252 bytes) in IE5-IE8.  This is currently the case, but I think Henri
 Sivonen has argued that `misinterpretation for compatibility' should not be
 considered a conformance error (which would probably be fairly harmless for
 other mappings).

I (and the spec) agree with you here, that these should be reported as 
errors.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-06-03 Thread Ian Hickson
On Tue, 14 Apr 2009, Øistein E. Andersen wrote:

 This e-mail is an attempt to give a relatively concise yet reasonably complete
 overview of non-Unicode character sets and encodings for `Chinese characters',
 excluding those which are not supported by at least one of the four browsers
 IE, Safari, Firefox and Opera (henceforth `all browsers'), and tentatively
 avoiding technical details which are out of scope for HTML5 unless they are
 important to gain a general understanding of the relevant issues.
 
 To avoid unnecessary confusion, the following three concepts are kept
 distinct:
 
 1) Character set: A collection of characters, typically defined as a matrix
 with 94 rows and 94 columns.  (A character set with more than one matrix is
 said to have multiple planes.)  The ones officially registered `for use with
 escape sequences' (typically in ISO-2022 encodings, see below) can be found at
 http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm.
 
 2) Encoding: Defines how a given character (typically defined by its row and
 column numbers) from a given character set can be encoded as a sequence of
 bytes.  All the encodings discussed below allow multiple character sets to be
 encoded.  [ISO-2022 encodings use only 7-bit bytes and employ escape sequences
 to switch between different character sets. EUC encodings use bytes  128 for
 ASCII (or something similar) and bytes = 128 to encode other character sets.]
 
 3) MIME charset string: This is the string used, e.g., in a HTTP Content-Type
 header to indicate the *encoding*.  Many of these can be found at
 http://www.iana.org/assignments/character-sets.
 
 Some information about browser support for specific character sets, encodings
 and MIME charset strings can be found at
 http://coq.no/character-tables/mime/iso-2022/en,
 http://coq.no/character-tables/mime/euc/en and
 http://coq.no/character-tables/mime/locale-specific/en.
 
 The notation a  b means that a is a proper subset of b; a and b can be either
 character sets or encodings.
 
 
 **
 * What should HTML 5 say about all this? *
 **
 
 This section gives a summary of superset encodings which are either
 universally supported or potentially needed for compatibility.
 
 (Anyone who is going to read the entire e-mail will probably prefer to read
 the sections *Chinese*, *Japanese* and *Korean* at this point and return to
 this section afterwards.)
 
 
 Superset encodings (stricto sensu)
 --
 
 HTML5 currently contains a table of encodings aliases, of which the following
 involve Chinese characters:
 
 1) EUC-KR  -  Windows-949
 2) GB2312  -  GBK
 3) GB_2312-80  -  GBK
 4) KS_C_5601-1987  -  Windows-949
 5) x-x-big5-  Big5
 
 EUC-KR  Windows-949, and all browsers do 1), so this is reasonable and
 probably needed.
 
 GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80,
 which can be expressed not only in EUC-CN encoding, but also in ISO-2022-CN
 encoding and HZ encoding.  GBK, on the other hand, is an encoding.  EUC-CN 
 GBK.  It would be more correct to remove 2) and 3) and instead add:
EUC-CN  -  GBK
 
 Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and registered
 MIME charset strings include GB_2312-80 and GB_2312-80 as distinct entries
 (but not EUC-CN), so a note to this effect might be appropriate.
 
 (Additionally, GBK is slightly ambiguous, so make sure not to reference an
 incomplete or outdated version without pointing out necessary
 amendments/additions.)
 
 Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or
 `KS_C_5601-1987', which Ken Lunde characterises as `incorrect and dangerous'
 in his book /CJKV Information Processing/.  It would be more correct to remove
 4).
 
 Unlike EUC-CN, EUC-KR is a registered MIME charset string, but KS_C_5601-1987
 has a distinct entry, so a note might again be appropriate.
 
 As for 5), the MIME charset string x-x-big5 does indeed correspond to Big5
 encoding (or rather an extension thereof) in all browsers but Opera.  There is
 a large number of unregistered charset strings, however, and the other
 mappings in this table are between encodings.  Unless x-x-big5 is actually
 supposed to refer to an encoding distinct from Big5, 5) should be removed.
 
 Instead (depending on the reference ultimately given for Big5), it may be
 necessary to note that at least certain ETen extensions should be regarded as
 part of Big5.

I believe you misunderstand the purpose of this table. The idea is to give 
a mapping of _labels_ to encodings, not encodings to encodings. I've 
clarified the text to this effect.



 In addition, Shift_JIS  Windows-31J, and all browsers implement this mapping,
 so the following should be added:
Shift_JIS   -  Windows-31J

Added.


I haven't added the mappings described below, since they are not all 
implemented uniformly. If specific mappings are 

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-04-13 Thread Øistein E . Andersen
This e-mail is an attempt to give a relatively concise yet reasonably  
complete overview of non-Unicode character sets and encodings for  
`Chinese characters', excluding those which are not supported by at  
least one of the four browsers IE, Safari, Firefox and Opera  
(henceforth `all browsers'), and tentatively avoiding technical  
details which are out of scope for HTML5 unless they are important to  
gain a general understanding of the relevant issues.


To avoid unnecessary confusion, the following three concepts are kept  
distinct:


1) Character set: A collection of characters, typically defined as a  
matrix with 94 rows and 94 columns.  (A character set with more than  
one matrix is said to have multiple planes.)  The ones officially  
registered `for use with escape sequences' (typically in ISO-2022  
encodings, see below) can be found at http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm 
.


2) Encoding: Defines how a given character (typically defined by its  
row and column numbers) from a given character set can be encoded as a  
sequence of bytes.  All the encodings discussed below allow multiple  
character sets to be encoded.  [ISO-2022 encodings use only 7-bit  
bytes and employ escape sequences to switch between different  
character sets. EUC encodings use bytes  128 for ASCII (or something  
similar) and bytes = 128 to encode other character sets.]


3) MIME charset string: This is the string used, e.g., in a HTTP  
Content-Type header to indicate the *encoding*.  Many of these can be  
found at http://www.iana.org/assignments/character-sets.


Some information about browser support for specific character sets,  
encodings and MIME charset strings can be found at http://coq.no/character-tables/mime/iso-2022/en 
, http://coq.no/character-tables/mime/euc/en and http://coq.no/character-tables/mime/locale-specific/en 
.


The notation a  b means that a is a proper subset of b; a and b can  
be either character sets or encodings.



**
* What should HTML 5 say about all this? *
**

This section gives a summary of superset encodings which are either  
universally supported or potentially needed for compatibility.


(Anyone who is going to read the entire e-mail will probably prefer to  
read the sections *Chinese*, *Japanese* and *Korean* at this point and  
return to this section afterwards.)



Superset encodings (stricto sensu)
--

HTML5 currently contains a table of encodings aliases, of which the  
following involve Chinese characters:


1) EUC-KR  -  Windows-949
2) GB2312  -  GBK
3) GB_2312-80  -  GBK
4) KS_C_5601-1987  -  Windows-949
5) x-x-big5-  Big5

EUC-KR  Windows-949, and all browsers do 1), so this is reasonable  
and probably needed.


GB2312 and GB_2312-80 technically refer to the *character set* GB  
2312-80, which can be expressed not only in EUC-CN encoding, but also  
in ISO-2022-CN encoding and HZ encoding.  GBK, on the other hand, is  
an encoding.  EUC-CN  GBK.  It would be more correct to remove 2) and  
3) and instead add:

   EUC-CN  -  GBK

Admittedly, EUC-CN is sometimes called `8-bit GB encoding', and  
registered MIME charset strings include GB_2312-80 and GB_2312-80 as  
distinct entries (but not EUC-CN), so a note to this effect might be  
appropriate.


(Additionally, GBK is slightly ambiguous, so make sure not to  
reference an incomplete or outdated version without pointing out  
necessary amendments/additions.)


Similarly, EUC-KR is sometimes referred to as `eight-bit KS' or  
`KS_C_5601-1987', which Ken Lunde characterises as `incorrect and  
dangerous' in his book /CJKV Information Processing/.  It would be  
more correct to remove 4).


Unlike EUC-CN, EUC-KR is a registered MIME charset string, but  
KS_C_5601-1987 has a distinct entry, so a note might again be  
appropriate.


As for 5), the MIME charset string x-x-big5 does indeed correspond to  
Big5 encoding (or rather an extension thereof) in all browsers but  
Opera.  There is a large number of unregistered charset strings,  
however, and the other mappings in this table are between encodings.   
Unless x-x-big5 is actually supposed to refer to an encoding distinct  
from Big5, 5) should be removed.


Instead (depending on the reference ultimately given for Big5), it may  
be necessary to note that at least certain ETen extensions should be  
regarded as part of Big5.


In addition, Shift_JIS  Windows-31J, and all browsers implement this  
mapping, so the following should be added:

   Shift_JIS   -  Windows-31J


Further superset encodings (probably not needed)


ISO-2022-CN  ISO-2022-CN-EXT

This is reasonable, but probably not necessary: Firefox does it,  
Safari does not, Opera does not implement the superset, IE does not  
even implement the subset.  Distinguishing between them is pointless.




Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-04-12 Thread Øistein E . Andersen

On 2 Sep 2008, at 06:06, Ian Hickson wrote:


On Wed, 30 Jul 2008, Øistein E. Andersen wrote:


1. Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
   IE7, on the other hand, simply ignores the high bit (as it does  
for

   a few other 7-bit encodings, by the way).  Perhaps this
   alias could be dropped from the other browsers.


Ignoring the high bit seems like a dangerous security bug; dropping  
any

character with a high bit as U+FFFD seems unnecessarily drastic.


According to a test I did using browsershots.org, IE8 actually seems  
to do this (8-bit characters are rendered as squares), which looks  
like an argument in favour of the more `drastic' option.



I've made the spec go with the O/F/S behaviour here.


This has the advantage of not adding ASCII as a separate encoding, and  
Windows-1252 is presumably one of the encodings most often mislabelled  
as ASCII.  However, IE has ignored the high bit at least since 5.01  
(IE4 via browsershots.org treats it as CP1252, but this could well be  
locale-dependent), so there may not be that many mislabelled pages.   
Has anyone got a list of pages which are labelled as ASCII and contain  
8-bit characters?


This is probably not very important.  U+FFFD is `purer', Windows-1252  
has the potential of rescuing a few pages.  It is however essential  
that 8-bit characters be considered not conforming since they do not  
in fact work (as Windows-1252 bytes) in IE5-IE8.  This is currently  
the case, but I think Henri Sivonen has argued that `misinterpretation  
for compatibility' should not be considered a conformance error (which  
would probably be fairly harmless for other mappings).



4. Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite  
inconsistently; [...]




I think the HTML5 spec does what is necessary here, but it may be  
that the

encodings specs are vague still.


[For the record, HTML5 currently requires delete and C1 characters (as  
well as C0 save white space) to be replaced by U+FFFD during `pre- 
processing of the input stream', which effectively circumvents the  
problem that character encoding specifications treat this range in a  
vague and inconsistent manner.]


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2009-04-11 Thread Øistein E . Andersen

On 22 May 2008, at 12:40, Ian Hickson wrote:


Do you have input on the EUC-JP issue?


I am now about to finish my analysis of CJK encodings (e-mail  
forthcoming), including EUC-JP.  This encoding does not seem to be  
particularly problematic, however.  Are you referring to a specific  
problem?



On Thu, 13 Mar 2008, Øistein E. Andersen wrote:
Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct  
from

 ISO-2022-JP. This is something to keep in mind when looking at
 multi-byte encodings.


What should we say about this?


The issue seems to be that IE's implementation of ISO-2022-JP is a  
large superset of what is actually specified.  (This is the case for  
several other CJK encodings as well.)  See forthcoming e-mail for an  
actual description of the extensions.



(TC)VN5712-2  (TC)VN5712-1

Opera[?] and Firefox seem to have implemented the superset only.


Should we require this mapping?


For reference:
(TC)VN5712-3(TC)VN5712-2 = VSCII-2 = ISO IR 180(TC)VN5712-1

Only the complete set seems to be implemented (and only in Firefox),  
and MIME charset strings referring to one of the subsets do not seem  
to work at all, so no mappings are necessary.


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2008-09-01 Thread Ian Hickson
On Wed, 30 Jul 2008, �istein E. Andersen wrote:
 
 The current table seems to cover the mappings between different common 
 compatible 8-bit encodings as implemented in IE7, yes.  The table at 
 http://coq.no/character-tables/mime/en gives a bit more detail, most 
 of which is better kept outside HTML5 itself. However, the following 
 observations can be made:
 
 1.  Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
 IE7, on the other hand, simply ignores the high bit (as it does for
 a few other 7-bit encodings, by the way).  Perhaps this
 alias could be dropped from the other browsers.

Ignoring the high bit seems like a dangerous security bug; dropping any 
character with a high bit as U+FFFD seems unnecessarily drastic. I've made 
the spec go with the O/F/S behaviour here.


 2.  Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as 
 per HTML5),
 whereas Safari seems to do the same for text/plain; charset=ISO-8859-11
 instead [Version 3.1.2 (5525.20.1)].  Bug?

I believe so.


 3.  For certain character sets, different browsers map to different, but 
 visually
 similar Unicode characters.  Sometimes, one mapping is old/outdated,
 but this is not always the case.

Not sure what I can do about that.


 4.  Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite 
 inconsistently;
 different browsers do different things for the same encoding, and the same
 browser gives analogous encodings different treatment.
 
 (For the early ISO-8859-* encodings, the IANA registry points to RFC 1345,
 which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really
 seem to regard this feature as an essential part of the character set:
 
 the charset is often coded with both
 graphical and control character sets.  If the coded character set is
 a 96-character set, it is tabled with the relevant GL set (normally
 ISO-IR-6) and with ISO 6429 as C0 and C1
 
 As for the Windows-* encodings, Microsoft documentation treats bytes
 in this range as unassigned unless they are mapped to graphical 
 characters,
 whereas Microsoft products return the underlying byte value in this case.)

I think the HTML5 spec does what is necessary here, but it may be that the 
encodings specs are vague still.


 5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former
 is probably more reasonable (assuming that letters are more important than
 line-drawing characters), but neither is actually correct given that the 
 encodings
 are, strictly speaking, incompatible.  This issue will of course look a 
 bit different
 if it can be shown that documents containing the letter Ў/ў (only in 
 KOI8-RU)
 are frequently mislabelled as KOI8-U.

I guess we'll see what feedback we get on this when testing begins.

Cheers,
-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2008-07-29 Thread Øistein E . Andersen
On 22 May 2008, at 12:40, Ian Hickson wrote:

 would you say that what the spec says now is what browsers 
 implement? What should we change?

The current table seems to cover the mappings between different common
compatible 8-bit encodings as implemented in IE7, yes.  The table at
http://coq.no/character-tables/mime/en gives a bit more detail,
most of which is better kept outside HTML5 itself. However, the following
observations can be made:

1.  Opera, Firefox and Safari all handle US-ASCII as Windows-1252.
IE7, on the other hand, simply ignores the high bit (as it does for
a few other 7-bit encodings, by the way).  Perhaps this
alias could be dropped from the other browsers.

2.  Firefox and Opera seem to sniff for text/plain; charset=ISO-8859-1 (as per 
HTML5),
whereas Safari seems to do the same for text/plain; charset=ISO-8859-11
instead [Version 3.1.2 (5525.20.1)].  Bug?

3.  For certain character sets, different browsers map to different, but 
visually
similar Unicode characters.  Sometimes, one mapping is old/outdated,
but this is not always the case.

4.  Delete (0x7F) and the C1 range (0x80--0x9F) are handled quite 
inconsistently;
different browsers do different things for the same encoding, and the same
browser gives analogous encodings different treatment.

(For the early ISO-8859-* encodings, the IANA registry points to RFC 1345,
which effectively maps 0x7F--0x9F to U+7F--U+9F, but does not really
seem to regard this feature as an essential part of the character set:

the charset is often coded with both
graphical and control character sets.  If the coded character set is
a 96-character set, it is tabled with the relevant GL set (normally
ISO-IR-6) and with ISO 6429 as C0 and C1

As for the Windows-* encodings, Microsoft documentation treats bytes
in this range as unassigned unless they are mapped to graphical characters,
whereas Microsoft products return the underlying byte value in this case.)

5. IE handles KOI8-U as KOI8-RU, whereas Safari does the opposite. The former
is probably more reasonable (assuming that letters are more important than
line-drawing characters), but neither is actually correct given that the 
encodings
are, strictly speaking, incompatible.  This issue will of course look a bit 
different
if it can be shown that documents containing the letter Ў/ў (only in 
KOI8-RU)
are frequently mislabelled as KOI8-U.

 Do you have input on the EUC-JP issue?

Not yet, but you can expect some input on CJK encodings at some point in
the future.

-- 
Øistein E. Andersen




Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2008-05-22 Thread Ian Hickson
On Thu, 13 Mar 2008, �istein E. Andersen wrote:
 On 5th June 2007, Øistein E. Andersen wrote:
 
  (To do this properly, what we really ought to do is look for
  C1 and undefined characters in all IANA charsets and semi-official
  mappings to Unicode and check 1) whether the gaps can be filled
  by borrowing from other encodings, and 2) whether browsers
  actually do so. [...])
 
 I have finally got round to looking at superset encodings.
 
 To do this, I started with Unicode mappings from [UNI] for 8-bit 1-byte
 alphabet encodings and added mappings for other such encodings
 implemented in Opera, Safari or Firefox, mostly from [CSETS], though
 I made one for Windows-Sami-2 from a PDF.  (I then discovered that IE
 had something called Arabic-ASMO, for which no matching specification 
 could be found, and subsequently reverse-engineered all IE's encodings.
 Most of these turned out to be identical to other mappings or only
 add characters from the PUA, but some real differences were found,
 and those are reported in the text below.)
 
 [UNI] http://unicode.org/Public/MAPPINGS/
 [CSETS] http://crl.nmsu.edu/~mleisher/csets.html
 
 All the character repertoires and encoding vectors defined by the mappings
 were then compared pairwise. (Codepoints mapped to C0, space, BS or C1
 were treated as unassigned, and directionality indicators for Arabic and
 Hebrew were ignored.) The result is quite a big and unreadable table
 [FULL], so the repertoires and encodings were clustered, which gave rise to
 the tables in [ENC], which compare charsets with less than 27 incompatible
 codepoints, as well as those in [REP], which compare charsets with at most
 60 characters not found in both repertoires. (The thresholds are arbitrary, 
 but 
 more than sufficiently large to assure that all related charsets will be
 clustered together and at the sime time sufficiently small to keep the
 tables at a reasonable size.)
 
 [FULL] http://coq.no/X/charset-table.html
 [ENC] http://coq.no/X/charset-enc.html
 [REP] http://coq.no/X/charset-rep.html
 
 A short summary of the most interesting/relevant results (supported by [ENC])
 can be found below.

This is quite amazing data, thank you.

I'm not sure what to do with it, frankly. Given your familiarity with the 
topic, would you say that what the spec says now is what browsers 
implement? What should we change?

Do you have input on the EUC-JP issue?


 PS: How should colour be added to tables like these in HTML5 with
 neither of the attributes bgcolor and style?

Class attribute and external stylesheets. (Possibly a data-* attribute.)



 Note: Similarly, IE apparently handles CS-ISO-2022-JP as distinct from
   ISO-2022-JP. This is something to keep in mind when looking at
   multi-byte encodings.

What should we say about this?


 (TC)VN5712-2  (TC)VN5712-1
 
 Opera and Firefox seem to have implemented the superset only.

Should we require this mapping?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2008-03-16 Thread Øistein E . Andersen
Krzysztof Żelechowski wrote:

 Some characters, like digits, are direction-transparent [...]
 Inserting an LTR mark before them makes them LTR.

Thanks.  I would have preferred a solution which did not involve inserting
extraneous characters, but I have now added LTR marks to fix the rendering.

-- 
Øistein E. Andersen



Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

2008-03-13 Thread Krzysztof Żelechowski

Dnia 13-03-2008, Cz o godzinie 02:04 +0100, Øistein E. Andersen pisze:
 PPS: Some right-to-left characters contaminate surrounding characters as I
  have not yet found a simple solution to make everything strictly
  left-to-right (probably because I have not looked for it properly).

Some characters, like digits, are direction-transparent, 
they inherit direction from the preceding text.  
Inserting an LTR mark before them makes them LTR.

Chris