Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
2013/8/1 Ian Hickson i...@hixie.ch: On Thu, 1 Aug 2013, Martin Janecke wrote: I don't see any sense in making a document that is declared as ISO-8859-1 and encoded as ISO-8859-1 non-conforming. Just because the ISO-8859-1 code points are a subset of windows-1252? So is US-ASCII. Should an US-ASCII declaration also be non-conforming then -- even if the document only contains bytes from the US-ASCII range? What's the benefit? I assume this is supposed to be helpful in some way, but to me it just seems wrong and confusing. If you avoid the bytes that are different in ISO-8859-1 and Win1252, the spec now allows you to use either label. (As well as cp1252, cp819, ibm819, l1, latin1, x-cp1252, etc.) The part that I find problematic is that if you use use byte 0x85 from Windows 1252 (U+2026 … HORIZONTAL ELLIPSIS), and then label the document as ansi_x3.4-1968, ascii, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1:1987, us-ascii, or a number of other options, it'll still be valid, and it'll work exactly as if you'd labeled it windows-1252. This despite the fact that in ASCII and in ISO-8859-1, byte 0x85 does not hap to U+2026. It maps to U+0085 in 8859-1, and it is undefined in ASCII (since ASCII is a 7 bit encoding). ISO-8859-1 vs. Windows-1252 issue sounds little issue because 0x85 is Next Line. As far as I know 0x85/U+0085 is used only in some IBM system. For Japanese encoding, there's Shift_JIS vs. Windows-31J issue, which people long annoyed. Windows-31J has many new characters which aren't included in Shift_JIS, and many different Unicode mappings from Shift_JIS. But many existing Web pages specify Shift_JIS and uses characters only in Windows-31J. Therefore if people want to specify a document as truly Shift_JIS, there's no way on the existing framework. It needs a new way for example a new meta specifier like META i-want-to-truly-specify-charset-as=Shift_JIS and browser recognize the document's encoding as true Shift_JIS. But such people should use UTF-8 instead of introducing such new one. -- NARUSE, Yui nar...@airemix.jp
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
On Mon, 1 Jul 2013, Glenn Maynard wrote: On Mon, Jul 1, 2013 at 6:16 PM, Ian Hickson i...@hixie.ch wrote: It seems bad, and maybe rather full of hubris, to make it conforming to use a label that we know will be interpreted in a manner that is a willful violation of its spec (that is, the ISO spec). It's hard enough to get people to label their encodings in the first place. It doesn't seem like a good idea to spend people's limited attention on encodings with you should change your encoding label, even though what you already have will always work, especially given how widespread the ISO-8859-1 label is. Fair enough. (FWIW, I wouldn't change a server to say windows-1252. The ISO spec is so far out of touch with reality that it's hard to consider it authoritative; in reality, ISO-8859-1 is 1252.) It certainly seems that that is how most software interprets it. On Tue, 2 Jul 2013, Jukka K. Korpela wrote: 2013-07-02 2:16, Ian Hickson wrote: The reason that ISO-8859-1 is currently non-conforming is that the label no longer means ISO-8859-1, as defined by the ISO. It actually means Windows-1252. Declaring ISO-8859-1 has no problems when the document does not contain bytes in the range 0x80...0x9F, as it should not. There is a huge number of existing pages to which this applies, and they are valid by HTML 4.01 (or, as the case may be, XHTML 1.0) rules. Declaring all of them as non-conforming and issuing an error message about them does not seem to be useful. Right. I note that you omitted to quote the following from my original e-mail: Previously, this was also somewhat the case, but it was only an error to use ISO-8859-1 in a manner that was not equivalent across both encodings (there was the concept of misinterpreted for compatibility). This was removed with the move to the Encoding spec. This kind of error handling is what I would personally prefer. You might say that such pages are risky and the risk should be announced, because if the page is later changed so that contains a byte in that range, it will not be interpreted by ISO-8859-1 but by windows-1252. Honestly merely not using UTF-8 is far more risky than the difference between 8859-1 and 1252. The encoding of the page is also the encoding used in a bunch of outgoing (encoding) features, and users aren't going to conveniently limit themselves to the character set of the encoding of the page when e.g. submitting forms. I think the simplest approach would be to declare U+0080...U+009F as forbidden in both serializations. I don't see any point in making them non-conforming in actual Win1252 content. That's not harmful. It seems bad, and maybe rather full of hubris, to make it conforming to use a label that we know will be interpreted in a manner that is a willful violation of its spec (that is, the ISO spec). In most cases, there is no violation of the ISO standard. Or, to put it in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully compatible with the ISO 8859-1 standard as long as the document does not contain data that would be interpreted by ISO 8859-1 as C1 Controls (U+0080...U+009F), which it should not contain. It's still a violation. I'm not saying we shouldn't violate it; it's clearly the right thing to do. But despite having many willful violations of other standards in the HTML standard, I wouldn't want us to ever get to a stage where we were casual in our violations, or where we minimised or dismissed the issue. I would rather go back to having the conflicts be caught by validators than just throw the ISO spec under the bus, but it's really up to you (Henri, and whoever else is implementing a validator). Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has done for years, and remains happy, until he tries to validate his page as HTML5. Is it useful that he gets an error message (and gets confused), even though his data is all ISO-8859-1 (without C1 Controls)? No, it's not. Like I said, I would rather go back to having the conflicts be caught by validators. Suppose then than he accidentally enters, say, the euro sign “€” because his text editor or other authoring tool lets him do – and stores it as windows-1252 encoded. Even then, no practical problem arises, due to the common error handling behavior, but at this point, it might be useful to give some diagnostic if the document is being validated. Right. Unfortunately it seems you and I are alone in thinking this. I would say that even then a warning about the problem would be sufficient, but it could be treated as an error There's not really a difference, in a validator. In any case, I've changed the spec to allow any label to be used for an encoding. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
On Thu, 1 Aug 2013, Martin Janecke wrote: I don't see any sense in making a document that is declared as ISO-8859-1 and encoded as ISO-8859-1 non-conforming. Just because the ISO-8859-1 code points are a subset of windows-1252? So is US-ASCII. Should an US-ASCII declaration also be non-conforming then -- even if the document only contains bytes from the US-ASCII range? What's the benefit? I assume this is supposed to be helpful in some way, but to me it just seems wrong and confusing. If you avoid the bytes that are different in ISO-8859-1 and Win1252, the spec now allows you to use either label. (As well as cp1252, cp819, ibm819, l1, latin1, x-cp1252, etc.) The part that I find problematic is that if you use use byte 0x85 from Windows 1252 (U+2026 … HORIZONTAL ELLIPSIS), and then label the document as ansi_x3.4-1968, ascii, iso-8859-1, iso-ir-100, iso8859-1, iso_8859-1:1987, us-ascii, or a number of other options, it'll still be valid, and it'll work exactly as if you'd labeled it windows-1252. This despite the fact that in ASCII and in ISO-8859-1, byte 0x85 does not hap to U+2026. It maps to U+0085 in 8859-1, and it is undefined in ASCII (since ASCII is a 7 bit encoding). -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
2013-07-02 2:16, Ian Hickson wrote: The reason that ISO-8859-1 is currently non-conforming is that the label no longer means ISO-8859-1, as defined by the ISO. It actually means Windows-1252. Declaring ISO-8859-1 has no problems when the document does not contain bytes in the range 0x80...0x9F, as it should not. There is a huge number of existing pages to which this applies, and they are valid by HTML 4.01 (or, as the case may be, XHTML 1.0) rules. Declaring all of them as non-conforming and issuing an error message about them does not seem to be useful. You might say that such pages are risky and the risk should be announced, because if the page is later changed so that contains a byte in that range, it will not be interpreted by ISO-8859-1 but by windows-1252. From the perspective of tradition and practice, this is just about error handling. By HTML 4.01, those bytes should be interpreted as control characters according to ISO-8859-1, and this would make the document invalid, since those control characters are disallowed in HTML 4.01. Thus, whatever browsers do with the document then is error processing, and nowadays probably all browsers have chosen to interpret them by windows-1252. Admittedly, in XHTML syntax it’s different since those control characters are not forbidden but (mostly) “just” discouraged. I think the simplest approach would be to declare U+0080...U+009F as forbidden in both serializations. Then the issue could be defined purely in terms of error handling. If you declare ISO-8859-1 and do not have bytes 0x80...0x9F, fine. If you do have such a byte, we should still treat the encoding declaration as conforming as such, but validators should report the characters as errors and browsers should handle this error by interpreting the document as if the declared encoding were windows-1252. It seems bad, and maybe rather full of hubris, to make it conforming to use a label that we know will be interpreted in a manner that is a willful violation of its spec (that is, the ISO spec). In most cases, there is no violation of the ISO standard. Or, to put it in another way, taking ISO-8859-1 as a synonym for windows-1252 is fully compatible with the ISO 8859-1 standard as long as the document does not contain data that would be interpreted by ISO 8859-1 as C1 Controls (U+0080...U+009F), which it should not contain. I would rather go back to having the conflicts be caught by validators than just throw the ISO spec under the bus, but it's really up to you (Henri, and whoever else is implementing a validator). Consider a typical case. Joe Q. Author is using ISO-8859-1 as he has done for years, and remains happy, until he tries to validate his page as HTML5. Is it useful that he gets an error message (and gets confused), even though his data is all ISO-8859-1 (without C1 Controls)? Suppose then than he accidentally enters, say, the euro sign “€” because his text editor or other authoring tool lets him do – and stores it as windows-1252 encoded. Even then, no practical problem arises, due to the common error handling behavior, but at this point, it might be useful to give some diagnostic if the document is being validated. I would say that even then a warning about the problem would be sufficient, but it could be treated as an error – as a data error, with defined error handling. The occurrences of the offending bytes should be reported (which is what now happens when validating as HTML 4.01, even though the error messages are cryptic, like “non SGML character number 128”). The author might then decide to declare the encoding as windows-1252. But even though the most common cause of such a situation is an attempt to use (mostly due to ignorance) certain characters without realizing that they do not exist in ISO-8859-1, it might be a symptom of some different problem, like malformed data unintentionally appearing in a document. It is thus useful to draw the author’s attention to specific problems, incorrect data where it appears, rather than blindly taking ISO-8859-1 as windows-1252. Yucca
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
On Tue, Jul 2, 2013 at 8:05 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote: [...] I think a much more interesting problem is when they update that old page with an IRI, form, or some XMLHttpRequest, and shit hits the fan. That's why you want to flag all non-utf-8 usage and just get people to migrate towards sanity. -- http://annevankesteren.nl/
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
2013-07-02 10:15, Anne van Kesteren wrote: On Tue, Jul 2, 2013 at 8:05 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote: [...] I think a much more interesting problem is when they update that old page with an IRI, form, or some XMLHttpRequest, and shit hits the fan. That's why you want to flag all non-utf-8 usage and just get people to migrate towards sanity. Such evangelism is a different issue. If you want to nag “you should use UTF-8”, as a warning, each and every time when someones declares any other encoding, you will confuse or irritate many people and will reduce the popularity of validators. But in any case, it is quite distinct from the issue of declaring the iso-8859-1 encoding as an error. Yucca
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
On Tue, Jul 2, 2013 at 9:24 AM, Jukka K. Korpela jkorp...@cs.tut.fi wrote: Such evangelism is a different issue. If you want to nag “you should use UTF-8”, as a warning, each and every time when someones declares any other encoding, you will confuse or irritate many people and will reduce the popularity of validators. But in any case, it is quite distinct from the issue of declaring the iso-8859-1 encoding as an error. Given that the validator is used primarily for new content, I doubt that very much. And most new content is utf-8 already. As for how it's related, http://encoding.spec.whatwg.org/ takes the stance currently that all non-utf-8 encodings are simply non-conforming, as there's too much room for error when using them. -- http://annevankesteren.nl/
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
On Tue, 26 Mar 2013, Henri Sivonen wrote: In various places that deal with encoding labels, the HTML spec now requires authors to use the name of the encoding from the Encoding Standard, which means using the preferred name rather than an alias. Compared to the previous reference to the IANA registry, some names that work in all browsers but are no longer preferred names are now errors, such as iso-8859-1 and tis-620. Making broadly-supported names that were previously preferred names according to IANA now be errors does not appear to provide any utility to Web authors who use validators. Please relax the requirement so that at least previously-preferred names are not errors. zcorpan suggested (http://krijnhoetmer.nl/irc-logs/whatwg/20130325#l-920) allowing non-preferred names for non-UTF-8 encodings. I'm not familiar with the level of browser support for all of the non-preferred aliases, but I could accept zcorpan's suggestion. The reason that ISO-8859-1 is currently non-conforming is that the label no longer means ISO-8859-1, as defined by the ISO. It actually means Windows-1252. Previously, this was also somewhat the case, but it was only an error to use ISO-8859-1 in a manner that was not equivalent across both encodings (there was the concept of misinterpreted for compatibility). This was removed with the move to the Encoding spec. It seems bad, and maybe rather full of hubris, to make it conforming to use a label that we know will be interpreted in a manner that is a willful violation of its spec (that is, the ISO spec). I would rather go back to having the conflicts be caught by validators than just throw the ISO spec under the bus, but it's really up to you (Henri, and whoever else is implementing a validator). Given the above context, do you still think we should make ISO-8859-1 unconditionally valid? If it is, I'll change the various places in the spec that refer to encoding names to also allow any of the encoding labels. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
On Mon, Jul 1, 2013 at 6:16 PM, Ian Hickson i...@hixie.ch wrote: It seems bad, and maybe rather full of hubris, to make it conforming to use a label that we know will be interpreted in a manner that is a willful violation of its spec (that is, the ISO spec). It's hard enough to get people to label their encodings in the first place. It doesn't seem like a good idea to spend people's limited attention on encodings with you should change your encoding label, even though what you already have will always work, especially given how widespread the ISO-8859-1 label is. (FWIW, I wouldn't change a server to say windows-1252. The ISO spec is so far out of touch with reality that it's hard to consider it authoritative; in reality, ISO-8859-1 is 1252.) -- Glenn Maynard
[whatwg] Requiring the Encoding Standard preferred name is too strict for no good reason
In various places that deal with encoding labels, the HTML spec now requires authors to use the name of the encoding from the Encoding Standard, which means using the preferred name rather than an alias. Compared to the previous reference to the IANA registry, some names that work in all browsers but are no longer preferred names are now errors, such as iso-8859-1 and tis-620. Making broadly-supported names that were previously preferred names according to IANA now be errors does not appear to provide any utility to Web authors who use validators. Please relax the requirement so that at least previously-preferred names are not errors. zcorpan suggested (http://krijnhoetmer.nl/irc-logs/whatwg/20130325#l-920) allowing non-preferred names for non-UTF-8 encodings. I'm not familiar with the level of browser support for all of the non-preferred aliases, but I could accept zcorpan's suggestion. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/