Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-08-03 Thread NARUSE, Yui
2013/8/1 Ian Hickson i...@hixie.ch:
 On Thu, 1 Aug 2013, Martin Janecke wrote:

 I don't see any sense in making a document that is declared as
 ISO-8859-1 and encoded as ISO-8859-1 non-conforming. Just because the
 ISO-8859-1 code points are a subset of windows-1252? So is US-ASCII.
 Should an US-ASCII declaration also be non-conforming then -- even if
 the document only contains bytes from the US-ASCII range? What's the
 benefit?

 I assume this is supposed to be helpful in some way, but to me it just
 seems wrong and confusing.

 If you avoid the bytes that are different in ISO-8859-1 and Win1252, the
 spec now allows you to use either label. (As well as cp1252, cp819,
 ibm819, l1, latin1, x-cp1252, etc.)

 The part that I find problematic is that if you use use byte 0x85 from
 Windows 1252 (U+2026 … HORIZONTAL ELLIPSIS), and then label the document
 as ansi_x3.4-1968, ascii, iso-8859-1, iso-ir-100, iso8859-1,
 iso_8859-1:1987, us-ascii, or a number of other options, it'll still
 be valid, and it'll work exactly as if you'd labeled it windows-1252.
 This despite the fact that in ASCII and in ISO-8859-1, byte 0x85 does not
 hap to U+2026. It maps to U+0085 in 8859-1, and it is undefined in ASCII
 (since ASCII is a 7 bit encoding).

ISO-8859-1 vs. Windows-1252 issue sounds little issue because 0x85 is Next Line.
As far as I know 0x85/U+0085 is used only in some IBM system.

For Japanese encoding, there's Shift_JIS vs. Windows-31J issue, which
people long annoyed.
Windows-31J has many new characters which aren't included in Shift_JIS,
and many different Unicode mappings from Shift_JIS.
But many existing Web pages specify Shift_JIS and uses characters
only in Windows-31J.
Therefore if people want to specify a document as truly Shift_JIS,
there's no way on the existing framework.
It needs a new way for example a new meta specifier like META
i-want-to-truly-specify-charset-as=Shift_JIS
and browser recognize the document's encoding as true Shift_JIS.

But such people should use UTF-8 instead of introducing such new one.

-- 
NARUSE, Yui  nar...@airemix.jp


Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-31 Thread Ian Hickson
On Mon, 1 Jul 2013, Glenn Maynard wrote:
 On Mon, Jul 1, 2013 at 6:16 PM, Ian Hickson i...@hixie.ch wrote:
 
  It seems bad, and maybe rather full of hubris, to make it conforming 
  to use a label that we know will be interpreted in a manner that is a 
  willful violation of its spec (that is, the ISO spec).
 
 It's hard enough to get people to label their encodings in the first 
 place.  It doesn't seem like a good idea to spend people's limited 
 attention on encodings with you should change your encoding label, even 
 though what you already have will always work, especially given how 
 widespread the ISO-8859-1 label is.

Fair enough.


 (FWIW, I wouldn't change a server to say windows-1252.  The ISO spec is 
 so far out of touch with reality that it's hard to consider it 
 authoritative; in reality, ISO-8859-1 is 1252.)

It certainly seems that that is how most software interprets it.


On Tue, 2 Jul 2013, Jukka K. Korpela wrote:

 2013-07-02 2:16, Ian Hickson wrote:
  
  The reason that ISO-8859-1 is currently non-conforming is that the 
  label no longer means ISO-8859-1, as defined by the ISO. It actually 
  means Windows-1252.
 
 Declaring ISO-8859-1 has no problems when the document does not contain 
 bytes in the range 0x80...0x9F, as it should not. There is a huge number 
 of existing pages to which this applies, and they are valid by HTML 4.01 
 (or, as the case may be, XHTML 1.0) rules. Declaring all of them as 
 non-conforming and issuing an error message about them does not seem to 
 be useful.

Right. I note that you omitted to quote the following from my original 
e-mail: Previously, this was also somewhat the case, but it was only an 
error to use ISO-8859-1 in a manner that was not equivalent across both 
encodings (there was the concept of misinterpreted for compatibility). 
This was removed with the move to the Encoding spec.

This kind of error handling is what I would personally prefer.


 You might say that such pages are risky and the risk should be announced,
 because if the page is later changed so that contains a byte in that range, it
 will not be interpreted by ISO-8859-1 but by windows-1252.

Honestly merely not using UTF-8 is far more risky than the difference 
between 8859-1 and 1252. The encoding of the page is also the encoding 
used in a bunch of outgoing (encoding) features, and users aren't going to 
conveniently limit themselves to the character set of the encoding of the 
page when e.g. submitting forms.


 I think the simplest approach would be to declare U+0080...U+009F as 
 forbidden in both serializations.

I don't see any point in making them non-conforming in actual Win1252 
content. That's not harmful.


  It seems bad, and maybe rather full of hubris, to make it conforming 
  to use a label that we know will be interpreted in a manner that is a 
  willful violation of its spec (that is, the ISO spec).
 
 In most cases, there is no violation of the ISO standard. Or, to put it 
 in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully 
 compatible with the ISO 8859-1 standard as long as the document does not 
 contain data that would be interpreted by ISO 8859-1 as C1 Controls 
 (U+0080...U+009F), which it should not contain.

It's still a violation.

I'm not saying we shouldn't violate it; it's clearly the right thing to 
do. But despite having many willful violations of other standards in the 
HTML standard, I wouldn't want us to ever get to a stage where we were 
casual in our violations, or where we minimised or dismissed the issue.


  I would rather go back to having the conflicts be caught by validators 
  than just throw the ISO spec under the bus, but it's really up to you 
  (Henri, and whoever else is implementing a validator).
 
 Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has 
 done for years, and remains happy, until he tries to validate his page 
 as HTML5. Is it useful that he gets an error message (and gets 
 confused), even though his data is all ISO-8859-1 (without C1 Controls)? 

No, it's not. Like I said, I would rather go back to having the conflicts 
be caught by validators.


 Suppose then than he accidentally enters, say, the euro sign “€” because 
 his text editor or other authoring tool lets him do – and stores it as 
 windows-1252 encoded. Even then, no practical problem arises, due to the 
 common error handling behavior, but at this point, it might be useful to 
 give some diagnostic if the document is being validated.

Right.

Unfortunately it seems you and I are alone in thinking this.


 I would say that even then a warning about the problem would be sufficient,
 but it could be treated as an error

There's not really a difference, in a validator.


In any case, I've changed the spec to allow any label to be used for an 
encoding.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things 

Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-31 Thread Ian Hickson
On Thu, 1 Aug 2013, Martin Janecke wrote:
 
 I don't see any sense in making a document that is declared as 
 ISO-8859-1 and encoded as ISO-8859-1 non-conforming. Just because the 
 ISO-8859-1 code points are a subset of windows-1252? So is US-ASCII. 
 Should an US-ASCII declaration also be non-conforming then -- even if 
 the document only contains bytes from the US-ASCII range? What's the 
 benefit?
 
 I assume this is supposed to be helpful in some way, but to me it just 
 seems wrong and confusing.

If you avoid the bytes that are different in ISO-8859-1 and Win1252, the 
spec now allows you to use either label. (As well as cp1252, cp819, 
ibm819, l1, latin1, x-cp1252, etc.)

The part that I find problematic is that if you use use byte 0x85 from 
Windows 1252 (U+2026 … HORIZONTAL ELLIPSIS), and then label the document 
as ansi_x3.4-1968, ascii, iso-8859-1, iso-ir-100, iso8859-1, 
iso_8859-1:1987, us-ascii, or a number of other options, it'll still 
be valid, and it'll work exactly as if you'd labeled it windows-1252. 
This despite the fact that in ASCII and in ISO-8859-1, byte 0x85 does not 
hap to U+2026. It maps to U+0085 in 8859-1, and it is undefined in ASCII 
(since ASCII is a 7 bit encoding).

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-02 Thread Jukka K. Korpela

2013-07-02 2:16, Ian Hickson wrote:


The reason that ISO-8859-1 is currently non-conforming is that the label
no longer means ISO-8859-1, as defined by the ISO. It actually means
Windows-1252.


Declaring ISO-8859-1 has no problems when the document does not contain 
bytes in the range 0x80...0x9F, as it should not. There is a huge number 
of existing pages to which this applies, and they are valid by HTML 4.01 
(or, as the case may be, XHTML 1.0) rules. Declaring all of them as 
non-conforming and issuing an error message about them does not seem to 
be useful.


You might say that such pages are risky and the risk should be 
announced, because if the page is later changed so that contains a byte 
in that range, it will not be interpreted by ISO-8859-1 but by 
windows-1252. From the perspective of tradition and practice, this is 
just about error handling. By HTML 4.01, those bytes should be 
interpreted as control characters according to ISO-8859-1, and this 
would make the document invalid, since those control characters are 
disallowed in HTML 4.01. Thus, whatever browsers do with the document 
then is error processing, and nowadays probably all browsers have chosen 
to interpret them by windows-1252.


Admittedly, in XHTML syntax it’s different since those control 
characters are not forbidden but (mostly) “just” discouraged.


I think the simplest approach would be to declare U+0080...U+009F as 
forbidden in both serializations. Then the issue could be defined purely 
in terms of error handling. If you declare ISO-8859-1 and do not have 
bytes 0x80...0x9F, fine. If you do have such a byte, we should still 
treat the encoding declaration as conforming as such, but validators 
should report the characters as errors and browsers should handle this 
error by interpreting the document as if the declared encoding were 
windows-1252.



It seems bad, and maybe rather full of hubris, to make it conforming to
use a label that we know will be interpreted in a manner that is a willful
violation of its spec (that is, the ISO spec).


In most cases, there is no violation of the ISO standard. Or, to put it 
in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully 
compatible with the ISO 8859-1 standard as long as the document does not 
contain data that would be interpreted by ISO 8859-1 as C1 Controls 
(U+0080...U+009F), which it should not contain.



I would rather go back to having the conflicts be caught by validators
than just throw the ISO spec under the bus, but it's really up to you
(Henri, and whoever else is implementing a validator).


Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has 
done for years, and remains happy, until he tries to validate his page 
as HTML5. Is it useful that he gets an error message (and gets 
confused), even though his data is all ISO-8859-1 (without C1 Controls)? 
Suppose then than he accidentally enters, say, the euro sign “€” because 
his text editor or other authoring tool lets him do – and stores it as 
windows-1252 encoded. Even then, no practical problem arises, due to the 
common error handling behavior, but at this point, it might be useful to 
give some diagnostic if the document is being validated.


I would say that even then a warning about the problem would be 
sufficient, but it could be treated as an error – as a data error, with 
defined error handling. The occurrences of the offending bytes should be 
reported (which is what now happens when validating as HTML 4.01, even 
though the error messages are cryptic, like “non SGML character number 
128”). The author might then decide to declare the encoding as windows-1252.


But even though the most common cause of such a situation is an attempt 
to use (mostly due to ignorance) certain characters without realizing 
that they do not exist in ISO-8859-1, it might be a symptom of some 
different problem, like malformed data unintentionally appearing in a 
document. It is thus useful to draw the author’s attention to specific 
problems, incorrect data where it appears, rather than blindly taking 
ISO-8859-1 as windows-1252.


Yucca




Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-02 Thread Anne van Kesteren
On Tue, Jul 2, 2013 at 8:05 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote:
 [...]

I think a much more interesting problem is when they update that old
page with an IRI, form, or some XMLHttpRequest, and shit hits the
fan. That's why you want to flag all non-utf-8 usage and just get
people to migrate towards sanity.


--
http://annevankesteren.nl/


Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-02 Thread Jukka K. Korpela

2013-07-02 10:15, Anne van Kesteren wrote:


On Tue, Jul 2, 2013 at 8:05 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote:

[...]


I think a much more interesting problem is when they update that old
page with an IRI, form, or some XMLHttpRequest, and shit hits the
fan. That's why you want to flag all non-utf-8 usage and just get
people to migrate towards sanity.


Such evangelism is a different issue. If you want to nag “you should use 
UTF-8”, as a warning, each and every time when someones declares any 
other encoding, you will confuse or irritate many people and will reduce 
the popularity of validators. But in any case, it is quite distinct from 
the issue of declaring the iso-8859-1 encoding as an error.


Yucca



Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-02 Thread Anne van Kesteren
On Tue, Jul 2, 2013 at 9:24 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote:
 Such evangelism is a different issue. If you want to nag “you should use
 UTF-8”, as a warning, each and every time when someones declares any other
 encoding, you will confuse or irritate many people and will reduce the
 popularity of validators. But in any case, it is quite distinct from the
 issue of declaring the iso-8859-1 encoding as an error.

Given that the validator is used primarily for new content, I doubt
that very much. And most new content is utf-8 already. As for how it's
related, http://encoding.spec.whatwg.org/ takes the stance currently
that all non-utf-8 encodings are simply non-conforming, as there's too
much room for error when using them.


--
http://annevankesteren.nl/


Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-01 Thread Ian Hickson
On Tue, 26 Mar 2013, Henri Sivonen wrote:

 In various places that deal with encoding labels, the HTML spec now 
 requires authors to use the name of the encoding from the Encoding 
 Standard, which means using the preferred name rather than an alias.
 
 Compared to the previous reference to the IANA registry, some names that 
 work in all browsers but are no longer preferred names are now errors, 
 such as iso-8859-1 and tis-620. Making broadly-supported names that were 
 previously preferred names according to IANA now be errors does not 
 appear to provide any utility to Web authors who use validators.
 
 Please relax the requirement so that at least previously-preferred names 
 are not errors.
 
 zcorpan suggested
 (http://krijnhoetmer.nl/irc-logs/whatwg/20130325#l-920) allowing
 non-preferred names for non-UTF-8 encodings. I'm not familiar with the
 level of browser support for all of the non-preferred aliases, but I
 could accept zcorpan's suggestion.

The reason that ISO-8859-1 is currently non-conforming is that the label 
no longer means ISO-8859-1, as defined by the ISO. It actually means 
Windows-1252. Previously, this was also somewhat the case, but it was 
only an error to use ISO-8859-1 in a manner that was not equivalent across 
both encodings (there was the concept of misinterpreted for 
compatibility). This was removed with the move to the Encoding spec.

It seems bad, and maybe rather full of hubris, to make it conforming to 
use a label that we know will be interpreted in a manner that is a willful 
violation of its spec (that is, the ISO spec).

I would rather go back to having the conflicts be caught by validators 
than just throw the ISO spec under the bus, but it's really up to you 
(Henri, and whoever else is implementing a validator).

Given the above context, do you still think we should make ISO-8859-1 
unconditionally valid?

If it is, I'll change the various places in the spec that refer to 
encoding names to also allow any of the encoding labels.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-07-01 Thread Glenn Maynard
On Mon, Jul 1, 2013 at 6:16 PM, Ian Hickson i...@hixie.ch wrote:

 It seems bad, and maybe rather full of hubris, to make it conforming to
 use a label that we know will be interpreted in a manner that is a willful
 violation of its spec (that is, the ISO spec).


It's hard enough to get people to label their encodings in the first
place.  It doesn't seem like a good idea to spend people's limited
attention on encodings with you should change your encoding label, even
though what you already have will always work, especially given how
widespread the ISO-8859-1 label is.

(FWIW, I wouldn't change a server to say windows-1252.  The ISO spec is so
far out of touch with reality that it's hard to consider it authoritative;
in reality, ISO-8859-1 is 1252.)

-- 
Glenn Maynard


[whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason

2013-03-26 Thread Henri Sivonen
In various places that deal with encoding labels, the HTML spec now
requires authors to use the name of the encoding from the Encoding
Standard, which means using the preferred name rather than an alias.

Compared to the previous reference to the IANA registry, some names
that work in all browsers but are no longer preferred names are now
errors, such as iso-8859-1 and tis-620. Making broadly-supported names
that were previously preferred names according to IANA now be errors
does not appear to provide any utility to Web authors who use
validators.

Please relax the requirement so that at least previously-preferred
names are not errors.

zcorpan suggested
(http://krijnhoetmer.nl/irc-logs/whatwg/20130325#l-920) allowing
non-preferred names for non-UTF-8 encodings. I'm not familiar with the
level of browser support for all of the non-preferred aliases, but I
could accept zcorpan's suggestion.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/