Re: [whatwg] Default encoding to UTF-8?
On Tue, Apr 3, 2012 at 10:08 PM, Anne van Kesteren ann...@opera.com wrote: I didn't mean a prescan. I meant proceeding with the real parse and switching decoders in midstream. This would have the complication of also having to change the encoding the document object reports to JavaScript in some cases. On IRC (#whatwg) zcorpan pointed out this would break URLs where entities are used to encode non-ASCII code points in the query component. Good point. So it's not worthwhile to add magic here. It's better that authors declare that they are using UTF-8. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: I mean the performance impact of reloading the page or, alternatively, the loss of incremental rendering.) A solution that would border on reasonable would be decoding as US-ASCII up to the first non-ASCII byte Thus possibly prescan of more than 1024 bytes? I didn't mean a prescan. I meant proceeding with the real parse and switching decoders in midstream. This would have the complication of also having to change the encoding the document object reports to JavaScript in some cases. and then deciding between UTF-8 and the locale-specific legacy encoding by examining the first non-ASCII byte and up to 3 bytes after it to see if they form a valid UTF-8 byte sequence. Except for the specifics, that sounds like more or less the idea I tried to state. May be it could be made into a bug in Mozilla? It's not clear that this is actually worth implementing or spending time on its this stage. However, there is one thing that should be added: The parser should default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII. That would break form submissions. But trying to gain more statistical confidence about UTF-8ness than that would be bad for performance (either due to stalling stream processing or due to reloading). So here you say tthat it is better to start to present early, and eventually reload [I think] if during the presentation the encoding choice shows itself to be wrong, than it would be to investigate too much and be absolutely certain before starting to present the page. I didn't intend to suggest reloading. Adding autodetection wouldn't actually force authors to use UTF-8, so the problem Faruk stated at the start of the thread (authors not using UTF-8 throughout systems that process user input) wouldn't be solved. If we take that logic to its end, then it would not make sense for the validator to display an error when a page contains a form without being UTF-8 encoded, either. Because, after all, the backend/whatever could be non-UTF-8 based. The only way to solve that problem on those systems, would be to send form content as character entities. (However, then too the form based page should still be UTF-8 in the first place, in order to be able to take any content.) Presumably, when an author reacts to an error message, (s)he not only fixes the page but also the back end. When a browser makes encoding guesses, it obviously cannot fix the back end. [ Original letter continued: ] Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding detection. So it might still be an competitive advantage. It would be interesting to know what exactly Chrome does. Maybe someone who knows the code could enlighten us? +1 (But their approach looks similar to the 'border on sane' approach you presented. Except that they seek to detect also non-UTF-8.) I'm slightly disappointed but not surprised that this thread hasn't gained a message explaining what Chrome does exactly. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
On Tue, 03 Apr 2012 13:59:25 +0200, Henri Sivonen hsivo...@iki.fi wrote: On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: A solution that would border on reasonable would be decoding as US-ASCII up to the first non-ASCII byte Thus possibly prescan of more than 1024 bytes? I didn't mean a prescan. I meant proceeding with the real parse and switching decoders in midstream. This would have the complication of also having to change the encoding the document object reports to JavaScript in some cases. On IRC (#whatwg) zcorpan pointed out this would break URLs where entities are used to encode non-ASCII code points in the query component. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Default encoding to UTF-8?
On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli l...@russisk.no wrote: It's unclear to me if you are talking about HTTP-level charset=UNICODE or charset=UNICODE in a meta. Is content labeled with charset=UNICODE BOMless? Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.) seems to usually be BOM-full. But there are still enough occurrences of pages without BOM. I have found UTF-8 pages with the charset=unicode label in meta. But the few page I found contained either BOM or HTTP-level charset=utf-8. I have to little research material when it comes to UTF-8 pages with charset=unicode inside. Making 'unicode' an alias of UTF-16 or UTF-16LE would be useless for pages that have a BOM, because the BOM is already inspected before meta and if HTTP-level charset is unrecognized, the BOM wins. Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for UTF-8-encoded pages that say charset=unicode in meta if alias resolution happens before UTF-16 labels are mapped to UTF-8. Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for pages that are (BOMless) UTF-16LE and that have charset=unicode in meta, because the meta prescan doesn't see UTF-16-encoded metas. Furthermore, it doesn't make sense to make the meta prescan look for UTF-16-encoded metas, because it would make sense to honor the value only if it matched a flavor of UTF-16 appropriate for the pattern of zero bytes in the file, so it would be more reliable and straight forward to just analyze the pattern of zero bytes without bothering to look for UTF-16-encoded metas. When the detector says UTF-8 - that is step 7 of the sniffing algorith, no? http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding Yes. 2) Start the parse assuming UTF-8 and reload as Windows-1252 if the detector says non-UTF-8. ... I think you are mistaken there: If parsers perform UTF-8 detection, then unlabelled pages will be detected, and no reparsing will happen. Not even increase. You at least need to explain this negative spiral theory better before I buy it ... Step 7 will *not* lead to reparsing unless the default encoding is WINDOWS-1252. If the default encoding is UTF-8, then step 7, when it detects UTF-8, then it means that parsing can continue uninterrupted. That would be what I labeled as option #2 above. What we will instead see is that those using legacy encodings must be more clever in labelling their pages, or else they won't be detected. Many pages that use legacy encodings are legacy pages that aren't actively maintained. Unmaintained pages aren't going to become more clever about labeling. I am a bitt baffled here: It sounds like you say that there will be bad consequences if browsers becomes more reliable ... Becoming more reliable can be bad if the reliability comes at the cost of performance, which would be the case if the kind of heuristic detector that e.g. Firefox has was turned on for all locales. (I don't mean the performance impact of running a detector state machine. I mean the performance impact of reloading the page or, alternatively, the loss of incremental rendering.) A solution that would border on reasonable would be decoding as US-ASCII up to the first non-ASCII byte and then deciding between UTF-8 and the locale-specific legacy encoding by examining the first non-ASCII byte and up to 3 bytes after it to see if they form a valid UTF-8 byte sequence. But trying to gain more statistical confidence about UTF-8ness than that would be bad for performance (either due to stalling stream processing or due to reloading). Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding detection. So it might still be an competitive advantage. It would be interesting to know what exactly Chrome does. Maybe someone who knows the code could enlighten us? * Let's say that I *kept* ISO-8859-1 as default encoding, but instead enabled the Universal detector. The frame then works. * But if I make the frame page very short, 10 * the letter ø as content, then the Universal detector fails - on a test on my own computer, it guess the page to be Cyrillic rather than Norwegian. * What's the problem? The Universal detector is too greedy - it tries to fix more problems than I have. I only want it to guess on UTF-8. And if it doesn't detect UTF-8, then it should fall back to the locale default (including fall back to the encoding of the parent frame). Wouldn't that be an idea? No. The current configuration works for Norwegian users already. For users from different silos, the ad might break, but ad breakage is less bad than spreading heuristic detection to more locales. Here I must disagree: Less bad for whom? For users performance-wise. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
On Tue, Jan 3, 2012 at 10:33 AM, Henri Sivonen hsivo...@iki.fi wrote: A solution that would border on reasonable would be decoding as US-ASCII up to the first non-ASCII byte and then deciding between UTF-8 and the locale-specific legacy encoding by examining the first non-ASCII byte and up to 3 bytes after it to see if they form a valid UTF-8 byte sequence. But trying to gain more statistical confidence about UTF-8ness than that would be bad for performance (either due to stalling stream processing or due to reloading). And it's worth noting that the above paragraph states a solution to the problem that is: How to make it possible to use UTF-8 without declaring it? Adding autodetection wouldn't actually force authors to use UTF-8, so the problem Faruk stated at the start of the thread (authors not using UTF-8 throughout systems that process user input) wouldn't be solved. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
Henri Sivonen, Tue Jan 3 00:33:02 PST 2012: On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli wrote: Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for UTF-8-encoded pages that say charset=unicode in meta if alias resolution happens before UTF-16 labels are mapped to UTF-8. Yup. Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for pages that are (BOMless) UTF-16LE and that have charset=unicode in meta, because the meta prescan doesn't see UTF-16-encoded metas. Hm. Yes. I see that I misread something, and ended up believing that the meta would *still* be used if the mapping from 'UTF-16' to 'UTF-8' turned out to be incorrect. I guess I had not understood, well enough, that the meta prescan *really* doesn't see UTF-16-encoded metas. Also contributing was the fact that I did nto realize that IE doesn't actually read the page as UTF-16 but as Windows-1252: http://www.hughesrenier.be/actualites.html. (Actually, browsers does see the UTF-16 meta, but only if the default encoding is set to be UTF-16 - see step 1 of '8.2.2.4 Changing the encoding while parsing' http://dev.w3.org/html5/spec/parsing.html#change-the-encoding.) Furthermore, it doesn't make sense to make the meta prescan look for UTF-16-encoded metas, because it would make sense to honor the value only if it matched a flavor of UTF-16 appropriate for the pattern of zero bytes in the file, so it would be more reliable and straight forward to just analyze the pattern of zero bytes without bothering to look for UTF-16-encoded metas. Makes sense. [ snip ] What we will instead see is that those using legacy encodings must be more clever in labelling their pages, or else they won't be detected. Many pages that use legacy encodings are legacy pages that aren't actively maintained. Unmaintained pages aren't going to become more clever about labeling. But their Non-UTF-8-ness should be picked up in the first 1024 bytes? [... sniff - sorry, meant snip ;-) ...] I mean the performance impact of reloading the page or, alternatively, the loss of incremental rendering.) A solution that would border on reasonable would be decoding as US-ASCII up to the first non-ASCII byte Thus possibly prescan of more than 1024 bytes? Is it faster to scan ASCII? (In Chrome, there does not seem to be an end to the prescan, as long as the text source code is ASCII only.) and then deciding between UTF-8 and the locale-specific legacy encoding by examining the first non-ASCII byte and up to 3 bytes after it to see if they form a valid UTF-8 byte sequence. Except for the specifics, that sounds like more or less the idea I tried to state. May be it could be made into a bug in Mozilla? (I could do it, but ...) However, there is one thing that should be added: The parser should default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII. Is that part of your idea? Because, if it does not behave like that, then it would work as Google Chrome now does work. Which for the following, UTF-8 encoded (but charset-un-labelled) page means, that it default to UTF-8: !DOCTYPE htmltitleæøå/title/html While it for this - identical - page, would default to the locale encoding, due to the use of ASCII based character entities, which causes that it does not detect any UTF-8-ish characters: !DOCTYPE htmltitle#xe6;#xf8;#xe5;/title/html As weird variant of the latter example is UTF-8 based data URIs, where all browsers (that I could test - IE only supports data URIs in the @src attribute, including script@src) default to the locale encoding (apart for Mozilla Camino - which has character detection enabled by default): data:text/html,!DOCTYPE htmltitle%C3%A6%C3%B8%C3%A5/title/html All the 3 examples above should default to UTF-8, if the border on sane approach was applied. But trying to gain more statistical confidence about UTF-8ness than that would be bad for performance (either due to stalling stream processing or due to reloading). So here you say tthat it is better to start to present early, and eventually reload [I think] if during the presentation the encoding choice shows itself to be wrong, than it would be to investigate too much and be absolutely certain before starting to present the page. Later, at Jan 3 00:50:26 PST 2012, you added: And it's worth noting that the above paragraph states a solution to the problem that is: How to make it possible to use UTF-8 without declaring it? Indeed. Adding autodetection wouldn't actually force authors to use UTF-8, so the problem Faruk stated at the start of the thread (authors not using UTF-8 throughout systems that process user input) wouldn't be solved. If we take that logic to its end, then it would not make sense for the validator to display an error when a page contains a form without being UTF-8 encoded, either. Because, after all, the backend/whatever could be non-UTF-8 based. The only way to solve that
Re: [whatwg] Default encoding to UTF-8?
Henri Sivonen hsivonen Mon Dec 19 07:17:43 PST 2011 On Sun, Dec 11, 2011 at 1:21 PM, Leif Halvard Silli wrote: Sorry for my slow reply. It surprises me greatly that Gecko doesn't treat unicode as an alias for utf-16. Which must EITHER mean that many of these pages *are* UTF-16 encoded OR that their content is predominantly US-ASCII and thus the artefacts of parsing UTF-8 pages (UTF-16 should be treated as UTF-8 when it isn't UTF-16) as WINDOWS-1252, do not affect users too much. It's unclear to me if you are talking about HTTP-level charset=UNICODE or charset=UNICODE in a meta. Is content labeled with charset=UNICODE BOMless? Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.) seems to usually be BOM-full. But there are still enough occurrences of pages without BOM. I have found UTF-8 pages with the charset=unicode label in meta. But the few page I found contained either BOM or HTTP-level charset=utf-8. I have to little research material when it comes to UTF-8 pages with charset=unicode inside. (2) for the user tests you suggested in Mozilla bug 708995 (above), the presence of meta charset=UNICODE would trigger a need for Firefox users to select UTF-8 - unless the locale already defaults to UTF-8; Hmm. The HTML spec isn't too clear about when alias resolution happens, to I (incorrectly, I now think) mapped only UTF-16, UTF-16BE and UTF-16LE (ASCII-case-insensitive) to UTF-8 in meta without considering aliases at that point. Hixie, was alias resolution supposed to happen first? In Firefox, alias resolution happen after, so meta charset=iso-10646-ucs-2 is ignored per the non-ASCII superset rule. Waiting to hear what Hixie says ... While UTF-8 is possible to detect, I really don't want to take Firefox down the road where users who currently don't have to suffer page load restarts from heuristic detection have to start suffering them. (I think making incremental rendering any less incremental for locales that currently don't use a detector is not an acceptable solution for avoiding restarts. With English-language pages, the UTF-8ness might not be apparent from the first 1024 bytes.) FIRSTLY, HTML5: ]] 8.2.2.4 Changing the encoding while parsing [...] This might happen if the encoding sniffing algorithm described above failed to find an encoding, or if it found an encoding that was not the actual encoding of the file. [[ Thus, trying to detect UTF-8 is second last step of the sniffing algorithm. If it, correctly, detects UTF-8, then, while the detection probably affects performance, detecting UTF-8 should not lead to a need for re-parsing the page? Let's consider, for simplicity, the locales for Western Europe and the Americas that default to Windows-1252 today. If browser in these locales started doing UTF-8-only detection, they could either: 1) Start the parse assuming Windows-1252 and reload if the detector says UTF-8. When the detector says UTF-8 - that is step 7 of the sniffing algorith, no? http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding 2) Start the parse assuming UTF-8 and reload as Windows-1252 if the detector says non-UTF-8. (Buffering the whole page is not an option, since it would break incremental loading.) Option #1 would be bad, because we'd see more and more reloading over time assuming that authors start using more and more UTF-8-enabled tools over time but don't go through the trouble of declaring UTF-8, since the pages already seem to work without declarations. So the so called badness is only a theory about what will happen - how the web will develop. As is, there is nothing particular bad about starting out with UTF-8 as the assumption. I think you are mistaken there: If parsers perform UTF-8 detection, then unlabelled pages will be detected, and no reparsing will happen. Not even increase. You at least need to explain this negative spiral theory better before I buy it ... Step 7 will *not* lead to reparsing unless the default encoding is WINDOWS-1252. If the default encoding is UTF-8, then step 7, when it detects UTF-8, then it means that parsing can continue uninterrupted. What we will instead see is that those using legacy encodings must be more clever in labelling their pages, or else they won't be detected. I am a bitt baffled here: It sounds like you say that there will be bad consequences if browsers becomes more reliable ... Option #2 would be bad, because pages that didn't reload before would start reloading and possibly executing JS side effects twice. #1 sounds least bad, since the only badness you describe is a theory about what this behaviour would lead to, w.r.t authors. SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8 pages, then it is the browsers *outside* that silo which eventually suffers (browser that do default to UTF-8 do not need to perform UTF-8 detect, I suppose - or what?). So then it is
Re: [whatwg] Default encoding to UTF-8?
On Sun, Dec 11, 2011 at 1:21 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: (which means *other-language* pages when the language of the localization doesn't have a pre-UTF-8 legacy). Do you have any concrete examples? The example I had in mind was Welsh. Logical candidate. WHat do you know about the Farsi and Arabic local? Nothing basically. I discovered that UNICODE is used as alias for UTF-16 in IE and Webkit. ... Seemingly, this has not affected Firefox users too much. It surprises me greatly that Gecko doesn't treat unicode as an alias for utf-16. Which must EITHER mean that many of these pages *are* UTF-16 encoded OR that their content is predominantly US-ASCII and thus the artefacts of parsing UTF-8 pages (UTF-16 should be treated as UTF-8 when it isn't UTF-16) as WINDOWS-1252, do not affect users too much. It's unclear to me if you are talking about HTTP-level charset=UNICODE or charset=UNICODE in a meta. Is content labeled with charset=UNICODE BOMless? (2) for the user tests you suggested in Mozilla bug 708995 (above), the presence of meta charset=UNICODE would trigger a need for Firefox users to select UTF-8 - unless the locale already defaults to UTF-8; Hmm. The HTML spec isn't too clear about when alias resolution happens, to I (incorrectly, I now think) mapped only UTF-16, UTF-16BE and UTF-16LE (ASCII-case-insensitive) to UTF-8 in meta without considering aliases at that point. Hixie, was alias resolution supposed to happen first? In Firefox, alias resolution happen after, so meta charset=iso-10646-ucs-2 is ignored per the non-ASCII superset rule. While UTF-8 is possible to detect, I really don't want to take Firefox down the road where users who currently don't have to suffer page load restarts from heuristic detection have to start suffering them. (I think making incremental rendering any less incremental for locales that currently don't use a detector is not an acceptable solution for avoiding restarts. With English-language pages, the UTF-8ness might not be apparent from the first 1024 bytes.) FIRSTLY, HTML5: ]] 8.2.2.4 Changing the encoding while parsing [...] This might happen if the encoding sniffing algorithm described above failed to find an encoding, or if it found an encoding that was not the actual encoding of the file. [[ Thus, trying to detect UTF-8 is second last step of the sniffing algorithm. If it, correctly, detects UTF-8, then, while the detection probably affects performance, detecting UTF-8 should not lead to a need for re-parsing the page? Let's consider, for simplicity, the locales for Western Europe and the Americas that default to Windows-1252 today. If browser in these locales started doing UTF-8-only detection, they could either: 1) Start the parse assuming Windows-1252 and reload if the detector says UTF-8. 2) Start the parse assuming UTF-8 and reload as Windows-1252 if the detector says non-UTF-8. (Buffering the whole page is not an option, since it would break incremental loading.) Option #1 would be bad, because we'd see more and more reloading over time assuming that authors start using more and more UTF-8-enabled tools over time but don't go through the trouble of declaring UTF-8, since the pages already seem to work without declarations. Option #2 would be bad, because pages that didn't reload before would start reloading and possibly executing JS side effects twice. SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8 pages, then it is the browsers *outside* that silo which eventually suffers (browser that do default to UTF-8 do not need to perform UTF-8 detect, I suppose - or what?). So then it is partly a matter of how large the silo is. Regardless, we must consider: The alternative to undeclared UTF-8 pages would be to be undeclared legacy encoding pages, roughly speaking. Which the browsers outside the silo then would have to detect. And which would be more *demand* to detect than simply detecting UTF-8. Well, so far (except for sv-SE (but no longer) and zh-TW), Firefox has not *by default* done cross-silo detection and has managed to get non-trivial market share, so it's not a given that browsers from outside a legacy silo *have to* detect. However, what you had in min was the change of the default encoding for a particular silo from legacy encoding to UTF-8. This, I agree, would lead to some pages being treated as UTF-8 - to begin with. But when the browser detects that this is wrong, it would have to switch to - probably - the old default - the legacy encoding. However, why would it switch *from* UTF-8 if UTF-8 is the default? We must keep the problem in mind: For the siloed browser, UTF-8 will be its fall-back encoding. Doesn't the first of these two paragraphs answer the question posed in the second one? It's rather counterintuitive that the persistent autodetection setting is in the same menu as the one-off override. You talk about
Re: [whatwg] Default encoding to UTF-8?
Leif Halvard Silli Sun Dec 11 03:21:40 PST 2011 W.r.t. iframe, then the big in Norway newspaper Dagbladet.no is declared ISO-8859-1 encoded and it includes a least one ads-iframe that ... * Let's say that I *kept* ISO-8859-1 as default encoding, but instead enabled the Universal detector. The frame then works. * But if I make the frame page very short, 10 * the letter ø as content, then the Universal detector fails - on a test on my own computer, it guess the page to be Cyrillic rather than Norwegian. * What's the problem? The Universal detector is too greedy - it tries to fix more problems than I have. I only want it to guess on UTF-8. And if it doesn't detect UTF-8, then it should fall back to the locale default (including fall back to the encoding of the parent frame). The above illustrates that the current charset-detection solutions are starting to get old: They are not geared and optimized towards UTF-8 as the firmly recommended and - in principle - anticipated default. The above may also catch a real problem with switching to UTF-8: that one may need to embed pages which do not use UTF-8: If one could trust UAs to attempt UTF-8 detection (but not Univeral detection) before defaulting, then it became virtually risk free to switch a page to UTF-8, even if it contains iframe pages. Not? Leif H Silli
Re: [whatwg] Default encoding to UTF-8?
On Fri, Dec 9, 2011 at 12:33 AM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: Henri Sivonen Tue Dec 6 23:45:11 PST 2011: These localizations are nevertheless live tests. If we want to move more firmly in the direction of UTF-8, one could ask users of those 'live tests' about their experience. Filed https://bugzilla.mozilla.org/show_bug.cgi?id=708995 (which means *other-language* pages when the language of the localization doesn't have a pre-UTF-8 legacy). Do you have any concrete examples? The example I had in mind was Welsh. And are there user complaints? Not that I know of, but I'm not part of a feedback loop if there even is a feedback loop here. The Serb localization uses UTF-8. The Croat uses Win-1252, but only on Windows and Mac: On Linux it appears to use UTF-8, if I read the HG repository correctly. OS-dependent differences are *very* suspicious. :-( I think that defaulting to UTF-8 is always a bug, because at the time these localizations were launched, there should have been no unlabeled UTF-8 legacy, because up until these locales were launched, no browsers defaulted to UTF-8 (broadly speaking). I think defaulting to UTF-8 is harmful, because it makes it possible for locale-siloed unlabeled UTF-8 content come to existence The current legacy encodings nevertheless creates siloed pages already. I'm also not sure that it would be a problem with such a UTF-8 silo: UTF-8 is possible to detect, for browsers - Chrome seems to perform more such detection than other browsers. While UTF-8 is possible to detect, I really don't want to take Firefox down the road where users who currently don't have to suffer page load restarts from heuristic detection have to start suffering them. (I think making incremental rendering any less incremental for locales that currently don't use a detector is not an acceptable solution for avoiding restarts. With English-language pages, the UTF-8ness might not be apparent from the first 1024 bytes.) In another message you suggested I 'lobby' against authoring tools. OK. But the browser is also an authoring tool. In what sense? So how can we have authors output UTF-8, by default, without changing the parsing default? Changing the default is an XML-like solution: creating breakage for users (who view legacy pages) in order to change author behavior. To the extent a browser is a tool Web authors use to test stuff, it's possible to add various whining to console without breaking legacy sites for users. See https://bugzilla.mozilla.org/show_bug.cgi?id=672453 https://bugzilla.mozilla.org/show_bug.cgi?id=708620 Btw: In Firefox, then in one sense, it is impossible to disable automatic character detection: In Firefox, overriding of the encoding only lasts until the next reload. A persistent setting for changing the fallback default is in the Advanced subdialog of the font prefs in the Content preference pane. It's rather counterintuitive that the persistent autodetection setting is in the same menu as the one-off override. As for heuristic detection based on the bytes of the page, the only heuristic that can't be disabled is the heuristic for detecting BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was believed to have been giving that sort of files to their customers and it worked in pre-HTML5 browsers that silently discarded all zero bytes prior to tokenization.) The Cyrillic and CJK detection heuristics can be turned on and off by the user. Within an origin, Firefox considers the parent frame and the previous document in the navigation history as sources of encoding guesses. That behavior is not user-configurable to my knowledge. Firefox also remembers the encoding from previous visits as long as Firefox otherwise has the page in cache. So for testing, it's necessary to make Firefox forget about previous visits to the test case. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
Henri Sivonen Tue Dec 6 23:45:11 PST 2011: On Mon, Dec 5, 2011 at 7:42 PM, Leif Halvard Silli wrote: Mozilla grants localizers a lot of latitude here. The defaults you see are not carefully chosen by a committee of encoding strategists doing whole-Web optimization at Mozilla. We could use such a committee for the Web! They are chosen by individual localizers. Looking at which locales default to UTF-8, I think the most probable explanation is that the localizers mistakenly tried to pick an encoding that fits the language of the localization instead of picking an encoding that's the most successful at decoding unlabeled pages most likely read by users of the localization These localizations are nevertheless live tests. If we want to move more firmly in the direction of UTF-8, one could ask users of those 'live tests' about their experience. (which means *other-language* pages when the language of the localization doesn't have a pre-UTF-8 legacy). Do you have any concrete examples? And are there user complaints? The Serb localization uses UTF-8. The Croat uses Win-1252, but only on Windows and Mac: On Linux it appears to use UTF-8, if I read the HG repository correctly. As for Croat and Window-1252, then it does not even support the Croat alphabet (in full) - I think about the digraphs. But I'm not sure about the pre-UTF-8 legacy for Croatian. Some language communities in Russia have a similar minority situation as Serb Cyrillic, only that their minority script is Latin: They use Cyrillic but they may also use Latin. But in Russia, Cyrillic dominates. Hence it seems to be the case - according to my earlier findings, that those few letters that, per each language, do not occur in Window-1251, are inserted as NCRs (that is: when UTF-8 is not used). That way, WIN-1251 can be used for Latin with non-ASCII inside. But given that Croat defaults to WIn-1252, they could in theory just use NCRs too ... Btw, for Safari on Mac, I'm unable to see any effect of switching locale: Always Win-1252 (Latin) - it used to have effect before ... But may be there is an parameter I'm unaware of - like Apple's knowledge of where in the World I live ... I think that defaulting to UTF-8 is always a bug, because at the time these localizations were launched, there should have been no unlabeled UTF-8 legacy, because up until these locales were launched, no browsers defaulted to UTF-8 (broadly speaking). I think defaulting to UTF-8 is harmful, because it makes it possible for locale-siloed unlabeled UTF-8 content come to existence The current legacy encodings nevertheless creates siloed pages already. I'm also not sure that it would be a problem with such a UTF-8 silo: UTF-8 is possible to detect, for browsers - Chrome seems to perform more such detection than other browsers. Today, perhaps especially for English users, it happens all the time that a page - without notice - defaults with regard to encoding - and this causes the browser - when used as an authoring tool - defaults to Windows-1252: http://twitter.com/#!/komputist/status/144834229610614784 (I suppose he used that browser based spec authoring tool that is in development.) In another message you suggested I 'lobby' against authoring tools. OK. But the browser is also an authoring tool. So how can we have authors output UTF-8, by default, without changing the parsing default? (instead of guiding all Web authors always to declare their use of UTF-8 so that the content works with all browser locale configurations). On must guide authors to do this regardless. I have tried to lobby internally at Mozilla for stricter localizer oversight here but have failed. (I'm particularly worried about localizers turning the heuristic detector on by default for their locale when it's not absolutely needed, because that's actually performance-sensitive and less likely to be corrected by the user. Therefore, turning the heuristic detector on may do performance reputation damage. ) W.r.t. heuristic detector: Testing the default encoding behaviour for Firefox was difficult. But in the end I understood that I must delete the cached version of the Profile folder - only then would the encodings 'fall back' properly. But before I came thus far, I tried with the e.g. the Russian version of Firefox, and discovered that it enabled the encoding heuristics: Thus it worked! Had it not done that, then it would instead have used Windows-1252 as the default ... So you perhaps need to be careful before telling them to disable heuristics ... Btw: In Firefox, then in one sense, it is impossible to disable automatic character detection: In Firefox, overriding of the encoding only lasts until the next reload. However, I just discovered that in Opera, this is not the case: If you select Windows-1252 in Opera, then it will always - but online the current Tab - be Windows-1252, even if there is a BOM and everything. In a way
Re: [whatwg] Default encoding to UTF-8?
On Tue, Dec 6, 2011 at 2:10 AM, Kornel Lesiński kor...@geekhood.net wrote: On Fri, 02 Dec 2011 15:50:31 -, Henri Sivonen hsivo...@iki.fi wrote: That compatibility mode already exists: It's the default mode--just like the quirks mode is the default for pages that don't have a doctype. You opt out of the quirks mode by saying !DOCTYPE html. You opt out of the encoding compatibility mode by saying meta charset=utf-8. Could !DOCTYPE html be an opt-in to default UTF-8 encoding? It would be nice to minimize number of declarations a page needs to include. I think that's a bad idea. We already have *three* backwards-compatible ways to opt into UTF-8. !DOCTYPE html isn't one of them. Moreover, I think it's a mistake to bundle a lot of unrelated things into one mode switch instead of having legacy-compatible defaults and having granular ways to opt into legacy-incompatible behaviors. (That is, I think, in retrospect, it's bad that we have a doctype-triggered standards mode with legacy-incompatible CSS defaults instead of having legacy-compatible CSS defaults and CSS properties for opting into different behaviors.) If you want to minimize the declarations, you can put the UTF-8 BOM followed by !DOCTYPE html at the start of the file. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
2011-12-06 6:54, Leif Halvard Silli wrote: Yeah, it would be a pity if it had already become an widespread cargo-cult to - all at once - use HTML5 doctype without using UTF-8 *and* without using some encoding declaration *and* thus effectively relying on the default locale encoding ... Who does have a data corpus? I think we wound need to ask search engine developers about that, but what is this proposed change to defaults supposed to achieve. It would break any old page that does not specify the encoding, as soon as the the doctype is changed to !doctype html or this doctype is added to a page that lacked a doctype. Since !doctype html is the simplest way to put browsers to standards mode, this would punish authors who have realized that their page works better in standards mode but are unaware of a completely different and fairly complex problem. (Basic character encoding issues are of course not that complex to you and me or most people around here; but most authors are more or less confused with them, and I don't think we should add to the confusion.) There's a little point in changing the specs to say something very different from what previous HTML specs have said and from actual browser behavior. If the purpose is to make things more exactly defined (a fixed encoding vs. implementation-defined), then I think such exactness is a luxury we cannot afford. Things would be all different if we were designing a document format from scratch, with no existing implementations and no existing usage. If the purpose is UTF-8 evangelism, then it would be just the kind of evangelism that produces angry people, not converts. If there's something that should be added to or modified in the algorithm for determining character encoding, the I'd say it's error processing. I mean user agent behavior when it detects, after running the algorithm, when processing the document data, that there is a mismatch between them. That is, that the data contains octets or octet sequences that are not allowed in the encoding or that denote noncharacters. Such errors are naturally detected when the user agent processes the octets; the question is what the browser should do then. When data that is actually in ISO-8859-1 or some similar encoding has been mislabeled as UTF-8 encoded, then, if the data contains octets outside the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets are just not possible in UTF-8 data. The converse error may also cause character-level errors. And these are not uncommon situations - they seem occur increasingly often, partly due to cargo cult use of UTF-8 (when it means declaring UTF-8 but not actually using it, or vice versa), partly due increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from somewhere into UTF-8 encoded data. From the user's point of view, the character-level errors currently result is some gibberish (e.g., some odd box appearing instead of a character, in one place) or in total mess (e.g. a large number non-ASCII characters displayed all wrong). In either case, I think an error should be signalled to the user, together with a) automatically trying another encoding, such as the locale default encoding instead of UTF-8 or UTF-8 instead of anything else b) suggesting to the user that he should try to view the page using some other encoding, possibly with a menu of encodings offered as part of the error explanation c) a combination of the above. Although there are good reasons why browsers usually don't give error messages, this would be a special case. It's about the primary interpretation of the data in the document and about a situation where some data has no interpretation in the assumed encoding - but usually has an interpretation in some other encoding. The current Character encoding overrides rules are questionable because they often mask out data errors that would have helped to detect problems that can be solved constructively. For example, if data labeled as ISO-8859-1 contains an octet in the 80...9F range, then it may well be the case that the data is actually windows-1252 encoded and the override helps everyone. But it may also be the case that the data is in a different encoding and that the override therefore results in gibberish shown to the user, with no hint of the cause of the problem. It would therefore be better to signal a problem to the user, display the page using the windows-1252 encoding but with some instruction or hint on changing the encoding. And a browser should in this process really analyze whether the data can be windows-1252 encoded data that contains only characters permitted in HTML. Yucca
Re: [whatwg] Default encoding to UTF-8?
(2011/12/06 17:39), Jukka K. Korpela wrote: 2011-12-06 6:54, Leif Halvard Silli wrote: Yeah, it would be a pity if it had already become an widespread cargo-cult to - all at once - use HTML5 doctype without using UTF-8 *and* without using some encoding declaration *and* thus effectively relying on the default locale encoding ... Who does have a data corpus? I found it: http://rink77.web.fc2.com/html/metatagu.html It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS, the default encoding of Japanese locale. Since !doctype html is the simplest way to put browsers to standards mode, this would punish authors who have realized that their page works better in standards mode but are unaware of a completely different and fairly complex problem. (Basic character encoding issues are of course not that complex to you and me or most people around here; but most authors are more or less confused with them, and I don't think we should add to the confusion.) I don't think there is a page works better in standards mode than *current* loose mode. There's a little point in changing the specs to say something very different from what previous HTML specs have said and from actual browser behavior. If the purpose is to make things more exactly defined (a fixed encoding vs. implementation-defined), then I think such exactness is a luxury we cannot afford. Things would be all different if we were designing a document format from scratch, with no existing implementations and no existing usage. If the purpose is UTF-8 evangelism, then it would be just the kind of evangelism that produces angry people, not converts. Agreed, if we design new spec, there's no reason to choose other than UTF-8. But HTML has long history and many content. We already have HTML*5* pages which doesn't have encoding declaration. If there's something that should be added to or modified in the algorithm for determining character encoding, the I'd say it's error processing. I mean user agent behavior when it detects, after running the algorithm, when processing the document data, that there is a mismatch between them. That is, that the data contains octets or octet sequences that are not allowed in the encoding or that denote noncharacters. Such errors are naturally detected when the user agent processes the octets; the question is what the browser should do then. Current implementations replaces such an invalid octet with a replacement character. Or some implementations scans almost the page and uses an encoding with which all octets in the page are valid. When data that is actually in ISO-8859-1 or some similar encoding has been mislabeled as UTF-8encoded, then, if the data contains octets outside the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets are just not possible in UTF-8 data. The converse error may also cause character-level errors. And these are not uncommon situations - they seem occur increasingly often, partly due to cargo cult use of UTF-8 (when it means declaring UTF-8 but not actually using it, or vice versa), partly due increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from somewhere into UTF-8 encoded data. In such case, the page should be failed to show on the author's environment. From the user's point of view, the character-level errors currently result is some gibberish (e.g., some odd box appearing instead of a character, in one place) or in total mess (e.g. a large number non-ASCII characters displayed all wrong). In either case, I think an error should be signalled to the user, together with a) automatically trying another encoding, such as the locale default encoding instead of UTF-8 or UTF-8 instead of anything else b) suggesting to the user that he should try to view the page using some other encoding, possibly with a menu of encodings offered as part of the error explanation c) a combination of the above. This premises that a user know the correct encoding. But European people really know the correct encoding of ISO-8859-* pages? I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and ISO-8859-2 page. Although there are good reasons why browsers usually don't give error messages, this would be a special case. It's about the primary interpretation of the data in the document and about a situation where some data has no interpretation in the assumed encoding - but usually has an interpretation in some other encoding. Some browsers alerts scripting issues. Why they cannot alerts an encoding issue? The current Character encoding overrides rules are questionable because they often mask out data errors that would have helped to detect problems that can be solved constructively. For example, if data labeled as ISO-8859-1 contains an octet in the 80...9F range, then it may well be the case that the data is actually
Re: [whatwg] Default encoding to UTF-8?
2011-12-06 15:59, NARUSE, Yui wrote: (2011/12/06 17:39), Jukka K. Korpela wrote: 2011-12-06 6:54, Leif Halvard Silli wrote: Yeah, it would be a pity if it had already become an widespread cargo-cult to - all at once - use HTML5 doctype without using UTF-8 *and* without using some encoding declaration *and* thus effectively relying on the default locale encoding ... Who does have a data corpus? I found it: http://rink77.web.fc2.com/html/metatagu.html I'm not sure of the intended purpose of that demo page, but it seems to illustrate my point. It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS, the default encoding of Japanese locale. My Firefox uses the ISO-8859-1 encoding, my IE the windows-1252 encoding, resulting in a mess of course. But the point is that both interpretations mean data errors at the character level - even seen as windows-1252, it contains bytes with no assigned meaning (e.g., 0x81 is UNDEFINED). Current implementations replaces such an invalid octet with a replacement character. No, it varies by implementation. When data that is actually in ISO-8859-1 or some similar encoding has been mislabeled as UTF-8 encoded, then, if the data contains octets outside the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets are just not possible in UTF-8 data. The converse error may also cause character-level errors. And these are not uncommon situations - they seem occur increasingly often, partly due to cargo cult use of UTF-8 (when it means declaring UTF-8 but not actually using it, or vice versa), partly due increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from somewhere into UTF-8 encoded data. In such case, the page should be failed to show on the author's environment. An authoring tool should surely indicate the problem. But what should user agents do when they face such documents and need to do something with them? From the user's point of view, the character-level errors currently result is some gibberish (e.g., some odd box appearing instead of a character, in one place) or in total mess (e.g. a large number non-ASCII characters displayed all wrong). In either case, I think an error should be signalled to the user, together with a) automatically trying another encoding, such as the locale default encoding instead of UTF-8 or UTF-8 instead of anything else b) suggesting to the user that he should try to view the page using some other encoding, possibly with a menu of encodings offered as part of the error explanation c) a combination of the above. This premises that a user know the correct encoding. Alternative b) means that the user can try some encodings. A user agent could give a reasonable list of options. Consider the example document mentioned. When viewed in a Western environment, it probably looks all gibberish. Alternative a) would probably not help, but alternative b) would have some chances. If the user has some reason to suspect that the page might be in Japanese, he would probably try the Japanese encodings in the browser's list of encodings, and this would make the document readable after a try or two. I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and ISO-8859-2 page. Yes, but the idea isn't really meant to apply to such cases, as there is no way to detect _at the character encoding level_ to recognize ISO-8859-1 mislabeled as ISO-8859-2 or vice versa. Some browsers alerts scripting issues. Why they cannot alerts an encoding issue? Surely they could, though I was not thinking an alert in a popup sense - rather, a red error indicator somewhere. There would be many more reasons to signal encoding issues than to signal scripting issues, as we know that web pages generally contain loads of client-side scripting errors that do not actually affect page rendering or functionality. The current Character encoding overrides rules are questionable because they often mask out data errors that would have helped to detect problems that can be solved constructively. For example, if data labeled as ISO-8859-1 contains an octet in the 80...9F range, then it may well be the case that the data is actually windows-1252 encoded and the override helps everyone. But it may also be the case that the data is in a different encoding and that the override therefore results in gibberish shown to the user, with no hint of the cause of the problem. I think such case doesn't exist. On character encoding overrides a superset overrides a standard set. Technically, not quite so (e.g., in ISO-8859-1, 0x81 is U+0081, a control character that is not allowed in HTML - I suppose, though I cannot really find a statement on this in HTML5 - whereas in windows-1252, it is undefined). More importantly my point was about errors in data, resulting e.g. from a faulty code conversion or some malfunctioning software that has produced,
Re: [whatwg] Default encoding to UTF-8?
2011-12-06 22:58, Leif Halvard Silli write: There is now a bug, and the editor says the outcome depends on a browser vendor to ship it: https://www.w3.org/Bugs/Public/show_bug.cgi?id=15076 Jukka K. Korpela Tue Dec 6 00:39:45 PST 2011 what is this proposed change to defaults supposed to achieve. […] I'd say the same as in XML: UTF-8 as a reliable, common default. The bug was created so that the argument given was: It would be nice to minimize number of declarations a page needs to include. That is, author convenience - so that authors could work sloppily and produce documents that could fail on user agents that haven't implemented this change. This sounds more absurd than I can describe. XML was created as a new data format; it was an entirely different issue. If there's something that should be added to or modified in the algorithm for determining character encoding, the I'd say it's error processing. I mean user agent behavior when it detects, [...] There is already an (optional) detection step in the algorithm - but UA treat that step differently, it seems. I'm afraid I can't find it - I mean the treatment of a document for which some encoding has been deduced (say, directly from HTTP headers) and which then turns out to violate the rules of the encoding. Yucca
Re: [whatwg] Default encoding to UTF-8?
On Mon, Dec 5, 2011 at 8:55 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: When you say 'requires': Of course, HTML5 recommends that you declare the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via meta charset=UTF-8). I just now also discovered that Validator.nu issues an error message if it does not find any of of those *and* the document contains non-ASCII. (I don't know, however, whether this error message is just something Henri added at his own discretion - it would be nice to have it literally in the spec too.) I believe I was implementing exactly what the spec said at the time I implemented that behavior of Validator.nu. I'm particularly convinced that I was following the spec, because I think it's not the optimal behavior. I think pages that don't declare their encoding should always be non-conforming even if they only contain ASCII bytes, because that way templates created by English-oriented (or lorem ipsum -oriented) authors would be caught as non-conforming before non-ASCII text gets filled into them later. Hixie disagreed. HTML5 says that validators *may* issue a warning if UTF-8 is *not* the encoding. But so far, validator.nu has not picked that up. Maybe it should. However, non-UTF-8 pages that label their encoding, that use one of the encodings that we won't be able to get rid of anyway and that don't contain forms aren't actively harmful. (I'd argue that they are *less* harmful than unlabeled UTF-8 pages.) Non-UTF-8 is harmful in form submission. It would be more focused to make the validator complain about labeled non-UTF-8 if the page contains a form. Also, it could be useful to make Firefox whine to console when a form is submitted in non-UTF-8 and when an HTML page has no encoding label. (I'd much rather implement all these than implement breaking changes to how Firefox processes legacy content.) We should also lobby for authoring tools (as recommended by HTML5) to default their output to UTF-8 and make sure the encoding is declared. HTML5 already says: Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629] http://dev.w3.org/html5/spec/semantics.html#charset I think focusing your efforts on lobbying authoring tool vendors to withhold the ability to save pages in non-UTF-8 encodings would be a better way to promote UTF-8 than lobbying browser vendors to change the defaults in ways that'd break locale-siloed Existing Content. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
L. David Baron on Wed Nov 30 18:29:31 PST 2011: On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote: My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. But how relevant is that still today? Has any browser done any recent research into the need for this? The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. You can see Firefox's defaults here: http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default (The localization and platform are part of the filename.) Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Don't users of those locales travel as much as you do? Or do we consider the English locale user's as more important? Something is broken in the logics here! I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago (by changing the intl.charset.default preference), and I do see a decent amount of broken content as a result (maybe I encounter a new broken page once a week? -- though substantially more often if I'm looking at non-English pages because of travel). What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. What is your locale? It is probably English. Or do you consider your locale to be 'the Western world locale'? It sounds like *that* is what Anne has in mind when he brings in Dutch: http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Faruk probably uses a Spanish locale - right?, so the two of you are not speaking out of the same context. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin. The more complex the rules, the harder they are for authors to understand / debug. I wouldn't want to create rules like those. Agree that that particular idea is probably not the best. I would, however, like to see movement towards defaulting to UTF-8: the current situation makes the Web less world-wide because pages that work for one user don't work for another. I'm just not quite sure how to get from here to there, though, since such changes are likely to make users experience broken content. I think we should 'attack' the dominating locale first: The English locale, in its different incarnations (Australian, American, UK). Thus, we should turn things on the head: English users should start to expect UTF-8 to be used. Because, as English users, you are more used to 'mojibake' than the rest of us are: Whenever you see it, you 'know' that it is because it is a foreign language you are reading. It is we, the users of non-English locales, that need the default-to-legacy encoding behavior the most. Or, please, explain to us when and where it is important that English language users living in their own, native lands so to speak, need that their browser default to Latin-1 so that they can correctly read English language pages? If the English locales start defaulting to UTF-8, then little by little, the same expectation etc will start spreading to the other locales as well, not least because the 'geeks' of each locale will tend to see the English locale as a super default - and they might also use the US English locale of their OS and/or browser. We should not consider the needs of geeks - they will follow (read: lead) the way, so the fact that *they* may see mojibake, should not be a concern. See? We would have a plan. Or what do you think? Of course, we - or rather: the
Re: [whatwg] Default encoding to UTF-8?
(And HTML5 defines it the same.) No. As far as I understand, HTML5 defines US-ASCII to be the default and requires that any other encoding is explicitly declared. I do like this approach. We should also lobby for authoring tools (as recommended by HTML5) to default their output to UTF-8 and make sure the encoding is declared. As so many pages, supposedly (I have not researched this), use the incorrect encoding, it makes no sense to try to clean this mess by messing with existing defaults. It may fix some pages and break others. Browsers have the ability to override an incorrect encoding and this a reasonable workaround. -- Sergiusz On Mon, Dec 5, 2011 at 6:42 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: L. David Baron on Wed Nov 30 18:29:31 PST 2011: On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote: My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. But how relevant is that still today? Has any browser done any recent research into the need for this? The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. You can see Firefox's defaults here: http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default (The localization and platform are part of the filename.) Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Don't users of those locales travel as much as you do? Or do we consider the English locale user's as more important? Something is broken in the logics here! I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago (by changing the intl.charset.default preference), and I do see a decent amount of broken content as a result (maybe I encounter a new broken page once a week? -- though substantially more often if I'm looking at non-English pages because of travel). What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. What is your locale? It is probably English. Or do you consider your locale to be 'the Western world locale'? It sounds like *that* is what Anne has in mind when he brings in Dutch: http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Faruk probably uses a Spanish locale - right?, so the two of you are not speaking out of the same context. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin. The more complex the rules, the harder they are for authors to understand / debug. I wouldn't want to create rules like those. Agree that that particular idea is probably not the best. I would, however, like to see movement towards defaulting to UTF-8: the current situation makes the Web less world-wide because pages that work for one user don't work for another. I'm just not quite sure how to get from here to there, though, since such changes are likely to make users experience broken content. I think we should 'attack' the dominating locale first: The English locale, in its different incarnations (Australian, American, UK). Thus, we should turn things on the head: English users should start to expect UTF-8 to be used. Because, as English users, you are more used to 'mojibake' than the rest of us are: Whenever you see it, you 'know' that it is because it is a foreign language you are reading. It is we, the users of non-English locales, that need the default-to-legacy encoding behavior the most. Or, please, explain to us
Re: [whatwg] Default encoding to UTF-8?
(And HTML5 defines it the same.) No. As far as I understand, HTML5 defines US-ASCII to be the default and requires that any other encoding is explicitly declared. I do like this approach. We are here discussing the default *user agent behaviour* - we are not specifically discussing how web pages should be authored. For use agents, then please be aware that HTML5 maintains a table over 'Suggested default encoding': http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding When you say 'requires': Of course, HTML5 recommends that you declare the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via meta charset=UTF-8). I just now also discovered that Validator.nu issues an error message if it does not find any of of those *and* the document contains non-ASCII. (I don't know, however, whether this error message is just something Henri added at his own discretion - it would be nice to have it literally in the spec too.) (The problem is of course that many English pages expect the whole Unicode alphabet even if they only contain US-ASCII from the start.) HTML5 says that validators *may* issue a warning if UTF-8 is *not* the encoding. But so far, validator.nu has not picked that up. We should also lobby for authoring tools (as recommended by HTML5) to default their output to UTF-8 and make sure the encoding is declared. HTML5 already says: Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629] http://dev.w3.org/html5/spec/semantics.html#charset As so many pages, supposedly (I have not researched this), use the incorrect encoding, it makes no sense to try to clean this mess by messing with existing defaults. It may fix some pages and break others. Browsers have the ability to override an incorrect encoding and this a reasonable workaround. Do you use a English locale computer? If you do, without being a native English speaker, then you are some kind of geek ... Why can't you work around the troubles -as you are used to anyway? Starting a switch to UTF-8 as the default UA encoding for English locale users should *only* affect how English locale users experience languages which *both* need non-ASCII *and* historically have been using Windows-1252 as the default encoding *and* which additionally do not include any encoding declaration. -- Leif Halvard Silli
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 12:42 PM, Leif Halvard Silli wrote: Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Because authors authoring pages that users of those locales tend to use use UTF-8 more than anything else? Don't users of those locales travel as much as you do? People on average travel less than David does, yes. In all locales. But that's not the point. I think you completely misunderstood his comments about travel and locales. Keep reading. What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. No. He's describing a problem using UTF-8 to view pages that are not written in English. Now what language are the non-English pages you look at written in? Well, it depends. In western Europe they tend to be in languages that can be encoded in ISO-8859-1, so authors sometimes use that encoding (without even realizing it). If you set your browser to default to UTF-8, those pages will be broken. In Japan, a number of pages are authored in Shift_JIS. Those will similarly be broken in a browser defaulting to UTF-8. What is your locale? Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) That's exactly what it is. Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Again, doesn't matter if you change your settings from the default. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? Did you miss the travel part? Want to look up web pages for museums, airports, etc in a non-English speaking country? There's a good chance they're not in English! This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. Yes. So? I think we should 'attack' the dominating locale first: The English locale, in its different incarnations (Australian, American, UK). Thus, we should turn things on the head: English users should start to expect UTF-8 to be used. Because, as English users, you are more used to 'mojibake' than the rest of us are: Whenever you see it, you 'know' that it is because it is a foreign language you are reading. Modulo smart quotes (and recently unicode ellipsis characters). These are actually pretty common in English text on the web nowadays, and have a tendency to be in ISO-8859-1. Or, please, explain to us when and where it is important that English language users living in their own, native lands so to speak, need that their browser default to Latin-1 so that they can correctly read English language pages? See above. See? We would have a plan. Or what do you think? Try it in your browser. When I set UTF-8 as my default, there were broke quotation marks all over the web for me. And I'm talking pages in English. -Boris
Re: [whatwg] Default encoding to UTF-8?
Boris Zbarsky Mon Dec 5 13:49:45 PST 2011: On 12/5/11 12:42 PM, Leif Halvard Silli wrote: Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Because authors authoring pages that users of those locales tend to use use UTF-8 more than anything else? It is more likely that there is another reason, IMHO: They may have tried it, and found that it worked OK. But they of course have the same need for reading non-English museum and railway pages as Mozilla employees. Don't users of those locales travel as much as you do? I think you completely misunderstood his comments about travel and locales. Keep reading. I'm pretty sure I haven't misunderstood very much. What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. No. He's describing a problem using UTF-8 to view pages that are not written in English. And why is that a problem in those cases when it is a problem? Do he read those languages, anyway? Don't we expect some problems when we thread out of our borders? Now what language are the non-English pages you look at written in? Well, it depends. In western Europe they tend to be in languages that can be encoded in ISO-8859-1, so authors sometimes use that encoding (without even realizing it). If you set your browser to default to UTF-8, those pages will be broken. In Japan, a number of pages are authored in Shift_JIS. Those will similarly be broken in a browser defaulting to UTF-8. The solution I proposed was that English locale browsers should default to UTF-8. Of course, to such users, then when in Japan, they could get problems - on some Japanese pages, which is a small nuisance, especially if they read Japansese. What is your locale? Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. The locale is meant to predominantly be used within a physical locale. If he is at another physical locale or a virtually other locale, he should not be expecting that it works out of the box unless a common encoding is used. Even today, if he visits Japan, he has to either change his browser settings *or* to rely on the pages declaring their encodings. So nothing would change, for him, when visiting Japan — with his browser or with his computer. Yes, there would be a change, w.r.t. Enlgish quotation marks (see below) and w.r.tg. visiting Western European languages pages: For those a number of pages which doesn't fail with Win-1252 as the default, would start to fail. But relatively speaking, it is less important that non-English pages fail for the English locale. (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) That's exactly what it is. A default for certain locales? Right. Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Again, doesn't matter if you change your settings from the default. I don't think I have misunderstood anything. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? Did you miss the travel part? Want to look up web pages for museums, airports, etc in a non-English speaking country? There's a good chance they're not in English! There is a very good chance, also, that only very few of the Web pages for such professional institutions would fail to declare their encoding. This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. Yes. So? So we can look away from Greek, Cyrillic, Japanese, Chinese etc etc in this debate. The eventually only benefit for English locale user of keeping WIN-1252 as the default, is that they can have a tiny number of fewer problems when visiting Western-European language web pages with their computer. (Yes, fI saw that you mention smart quotes etc below - so there is that reason too.) I think we should 'attack' the dominating
Re: [whatwg] Default encoding to UTF-8?
On Fri, 02 Dec 2011 15:50:31 -, Henri Sivonen hsivo...@iki.fi wrote: That compatibility mode already exists: It's the default mode--just like the quirks mode is the default for pages that don't have a doctype. You opt out of the quirks mode by saying !DOCTYPE html. You opt out of the encoding compatibility mode by saying meta charset=utf-8. Could !DOCTYPE html be an opt-in to default UTF-8 encoding? It would be nice to minimize number of declarations a page needs to include. -- regards, Kornel Lesiński
Re: [whatwg] Default encoding to UTF-8?
On Dec 5, 2011, at 4:10 PM, Kornel Lesiński wrote: Could !DOCTYPE html be an opt-in to default UTF-8 encoding? It would be nice to minimize number of declarations a page needs to include. I like that idea. Maybe it’s not too late. -- Darin
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 6:14 PM, Leif Halvard Silli wrote: It is more likely that there is another reason, IMHO: They may have tried it, and found that it worked OK Where by it you mean open a text editor, type some text, and save. So they get whatever encoding their OS and editor defaults to. And yes, then they find that it works ok, so they don't worry about encodings. No. He's describing a problem using UTF-8 to view pages that are not written in English. And why is that a problem in those cases when it is a problem? Because the characters are wrong? Do he read those languages, anyway? Do you read English? Seriously, what are you asking there, exactly? (For the record, reading a particular page in a language is a much simpler task than reading the language; I can't read German, but I can certainly read a German subway map.) The solution I proposed was that English locale browsers should default to UTF-8. I know the solution you proposed. That solution tries to avoid the issues David was describing by only breaking things for people in English browser locales, I understand that. Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. The locale is meant to predominantly be used within a physical locale. Yes, so? If he is at another physical locale or a virtually other locale, he should not be expecting that it works out of the box unless a common encoding is used. He was responding to a suggestion that the default encoding be changed to UTF-8 for all locales. Are you _really_ sure you understood the point of his mail? Even today, if he visits Japan, he has to either change his browser settings *or* to rely on the pages declaring their encodings. So nothing would change, for him, when visiting Japan — with his browser or with his computer. He wasn't saying it's a problem for him per se. He's a somewhat sophisticated browser user who knows how to change the encoding for a particular page. What he was saying is that there are lots of pages out there that aren't encoded in UTF-8 and rely on locale fallbacks to particular encodings, and that he's run into them a bunch while traveling in particular, so they were not pages in English. So far, you and he seem to agree. Yes, there would be a change, w.r.t. Enlgish quotation marks (see below) and w.r.tg. visiting Western European languages pages: For those a number of pages which doesn't fail with Win-1252 as the default, would start to fail. But relatively speaking, it is less important that non-English pages fail for the English locale. No one is worried about that, particularly. There is a very good chance, also, that only very few of the Web pages for such professional institutions would fail to declare their encoding. You'd be surprised. Modulo smart quotes (and recently unicode ellipsis characters). These are actually pretty common in English text on the web nowadays, and have a tendency to be in ISO-8859-1. If we change the default, they will start to tend to be in UTF-8. Not unless we change the authoring tools. Half the time these things are just directly exported from a word processor. OK: Quotation marks. However, in 'old web pages', then you also find much more use of HTML entities (such asldquo;) than you find today. We should take advantage of that, no? I have no idea what you're trying to say, When you mention quotation marks, then you mention a real locale related issue. And may be the Euro sign too? Not an issue for me personally, but it could be for some, yes. Nevertheless, the problem is smallest for languages that primarily limit their alphabet to those letter that are present in the American Standard Code for Information Interchange format. Sure. It may still be too big. It would be logical, thus, to start the switch to UTF-8 for those locales If we start at all. Perhaps we need to have a project to measure these problems, instead of all these anecdotes? Sure. More data is always better than ancedotes. -Boris
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 6:14 PM, Leif Halvard Silli wrote: It is more likely that there is another reason, IMHO: They may have tried it, and found that it worked OK Where by it you mean open a text editor, type some text, and save. So they get whatever encoding their OS and editor defaults to. If that is all they tested, then I'd said they did not test enough. And yes, then they find that it works ok, so they don't worry about encodings. Ditto. No. He's describing a problem using UTF-8 to view pages that are not written in English. And why is that a problem in those cases when it is a problem? Because the characters are wrong? But the characters will be wrong many more times than exactly those times when he tries to read a Web page with a Western European languages that is not declared as WIN-1252. Does English locale uses have particular expectations with regard to exactly those Web pages? What about Polish Web pages etc? English locale users is a very multiethnic lot. Do he read those languages, anyway? Do you read English? Seriously, what are you asking there, exactly? Because if it is an issue, then it is an about expectations for exactly those pages. (Plus the quote problem, of course.) (For the record, reading a particular page in a language is a much simpler task than reading the language; I can't read German, but I can certainly read a German subway map.) Or Polish subway map - which doesn't default to said encoding. The solution I proposed was that English locale browsers should default to UTF-8. I know the solution you proposed. That solution tries to avoid the issues David was describing by only breaking things for people in English browser locales, I understand that. That characterization is only true with regard to the quote problem. That German pages breaks would not be any more important than the fact that Polish pages would. For that matter: It happens that UTF-8 pages breaks as well. I only suggest it as a first step, so to speak. Or rather - since some locales apparently already default to UTF-9 - as a next step. Thereafter, more locales would be expected to follow suit - as the development of each locale permits. Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. The locale is meant to predominantly be used within a physical locale. Yes, so? So then we have a set of expectations for the language of that locale. If we look at how the locale settings handles other languages, then we are outside the issue that the locale specific encodings are supposed to handle. If he is at another physical locale or a virtually other locale, he should not be expecting that it works out of the box unless a common encoding is used. He was responding to a suggestion that the default encoding be changed to UTF-8 for all locales. Are you _really_ sure you understood the point of his mail? I said I agreed with him that Faruk's solution was not good. However, I would not be against treating DOCTYPE html as a 'default to UTF-8' declaration, as suggested by some - if it were possible to agree about that. Then we could keep things as they are, except for the HTML5 DOCTYPE. I guess the HTML5 doctype would become 'the default before the default': If everything else fails, then UTF-8 if the DOCTYPE is !DOCTYPE html, or else, the locale default. It sounded like Darin Adler thinks it possible. How about you? Even today, if he visits Japan, he has to either change his browser settings *or* to rely on the pages declaring their encodings. So nothing would change, for him, when visiting Japan — with his browser or with his computer. He wasn't saying it's a problem for him per se. He's a somewhat sophisticated browser user who knows how to change the encoding for a particular page. If we are talking about English locale user visiting Japan, then I doubt a change in the default encoding would matter - Win-1252 as default would anyway be wrong. What he was saying is that there are lots of pages out there that aren't encoded in UTF-8 and rely on locale fallbacks to particular encodings, and that he's run into them a bunch while traveling in particular, so they were not pages in English. So far, you and he seem to agree. So far we agree, yes. Yes, there would be a change, w.r.t. Enlgish quotation marks (see below) and w.r.tg. visiting Western European languages pages: For those a number of pages which doesn't fail with Win-1252 as the default, would start to fail. But relatively speaking, it is less important that non-English pages fail for the English locale. No one is worried about that, particularly. You spoke about visiting German pages above - sounded like you worried, but
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 9:55 PM, Leif Halvard Silli wrote: If that is all they tested, then I'd said they did not test enough. That's normal for the web. (For the record, reading a particular page in a language is a much simpler task than reading the language; I can't read German, but I can certainly read a German subway map.) Or Polish subway map - which doesn't default to said encoding. Indeed. I don't think anyone thinks the existing situation is all fine or anything. I said I agreed with him that Faruk's solution was not good. However, I would not be against treatingDOCTYPE html as a 'default to UTF-8' declaration This might work, if there hasn't been too much cargo-culting yet. Data urgently needed! Not unless we change the authoring tools. Half the time these things are just directly exported from a word processor. Please educate me. I'm perhaps 'handicapped' in that regard: I haven't used MS Word on a regular basis since MS Word 5.1 for Mac. Also, if export means copy and paste It can mean that, or save as HTML followed by copy and paste. then on the Mac, everything gets converted via the clipboard On Mac, the default OS encoding is UTF-8 last I checked. That's decidedly not the case on Windows. OK: Quotation marks. However, in 'old web pages', then you also find much more use of HTML entities (such as“) than you find today. We should take advantage of that, no? I have no idea what you're trying to say, Sorry. What I meant was that character entities are encoding independent. Yes. And that lots of people - and authoring tools - have inserted non-ASCII letters and characters as character entities, Sure. And lots have inserted them directly. At any rate: A page which uses character entities for non-ascii would render the same regardless of encoding, hence a switch to UTF-8 would not matter for those. Sure. We're not worried about such pages here. -Boris
Re: [whatwg] Default encoding to UTF-8?
Boris Zbarsky Mon Dec 5 19:18:10 PST 2011: On 12/5/11 9:55 PM, Leif Halvard Silli wrote: I said I agreed with him that Faruk's solution was not good. However, I would not be against treating DOCTYPE html as a 'default to UTF-8' declaration This might work, if there hasn't been too much cargo-culting yet. Data urgently needed! Yeah, it would be a pity if it had already become an widespread cargo-cult to - all at once - use HTML5 doctype without using UTF-8 *and* without using some encoding declaration *and* thus effectively relying on the default locale encoding ... Who does have a data corpus? Henri, as Validator.nu developer? This change would involve adding one more step in the HTML5 parser's encoding sniffing algorithm. [1] The question then is when, upon seeing the HTML5 doctype, the default to UTF-8 ought to happen, in order to be useful. It seems it would have to happen after the processing of the explicit meta data (Step 1 to 5) but before the last 3 steps - step 6, 7 and 8: Step 6: 'if the user agent has information on the likely encoding' Step 7: UA 'may attempt to autodetect the character encoding' Step 8: 'implementation-defined or user-specified default' The role of the HTML5 DOCTYPE, encoding wise, would then be to ensure that step 6 to 8 does not happen. [1] http://dev.w3.org/html5/spec/parsing#encoding-sniffing-algorithm -- Leif H Silli
Re: [whatwg] Default encoding to UTF-8?
On Fri, Dec 2, 2011 at 6:29 PM, Glenn Maynard gl...@zewt.org wrote: On Fri, Dec 2, 2011 at 10:46 AM, Henri Sivonen hsivo...@iki.fi wrote: Regarding your (and 16) remark, considering my personal happiness at work, I'd prioritize the eradication of UTF-16 as an interchange encoding much higher than eradicating ASCII-based non-UTF-8 encodings that all major browsers support. I think suggesting a solution to the encoding problem while implying that UTF-16 is not a problem isn't particularly appropriate. :-) ... I don't think I'd call it a bigger problem, though, since it's comparatively (even vanishingly) rare, where untagged legacy encodings are a widespread problem that gets worse every day we can't think of a way to curtail it. From implementation perspective, UTF-16 has its own class of bugs than are unlike other encoding-related bugs and fixing those bugs is particularly annoying because you know that UTF-16 is so rare that you know the fix has little actual utility. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
On Mon, Dec 5, 2011 at 1:30 AM, Henri Sivonen hsivo...@iki.fi wrote: From implementation perspective, UTF-16 has its own class of bugs than are unlike other encoding-related bugs and fixing those bugs is particularly annoying because you know that UTF-16 is so rare that you know the fix has little actual utility. There are lots of things like that on the platform, though, and this one doesn't really get worse over time. More and more content with untagged legacy encodings accumulates every day, regularly causing user-visible problems, which is why I'd call it a much bigger issue. -- Glenn Maynard
Re: [whatwg] Default encoding to UTF-8?
On Wed, 30 Nov 2011 21:29:31 -0500, L. David Baron dba...@dbaron.org wrote: I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago (by changing the intl.charset.default preference) Just to add, in Opera, you can goto Ctrl + F12 - General tab - Language section - Details and set Encoding to assume for pages lacking specification to utf-8. Or, do it via opera:config#Fallback%20HTML%20Encoding. I tried this years ago, but don't remember if it caused any problems on any web sites I visited. But, I quit setting it to utf-8 because I'd forget about it and it affected some web page encoding test cases where others would get different results on the tests because they had it set at the default (don't remember the details of the tests). -- Michael
Re: [whatwg] Default encoding to UTF-8?
On Thu, Dec 1, 2011 at 1:28 AM, Faruk Ates faruka...@me.com wrote: My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. As has already been pointed out, the default depends varies by locale. But how relevant is that still today? It's relevant for supporting the long tail of existing content. The sad part is that the mechanisms that allows existing legacy content to work within each locale silo also makes it possible for ill-informed or uncaring authors to develop more locale-siloed content (i.e. content that doesn't declare the encoding and, therefore, only works when the user's fallback encoding is the same as the author's). I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin. I think this would be a very bad idea. It would make debugging hard. Moreover, it would be the wrong heuristic, because well-maintained server infrastructure can host a lot of legacy content. Consider any shared hosting situation where the administrator of the server software isn't the content creator. We like to think that “every web developer is surely building things in UTF-8 nowadays” but this is far from true. I still frequently break websites and webapps simply by entering my name (Faruk Ateş). For things to work, the server-side component needs to deal with what gets sent to it. ASCII-oriented authors could still mishandle all non-ASCII even if Web browsers forced them to deal with UTF-8 by sending them UTF-8. Furthermore, your proposed solution wouldn't work for legacy software that correctly declares an encoding but declared a non-UTF-8 encoding. Sadly, getting sites to deal with your name properly requires the developer of each site to get a clue. :-( Just sending form submissions in UTF-8 isn't enough if the recipient can't deal. Compare with http://krijnhoetmer.nl/irc-logs/whatwg/20110906#l-392 Yes, I understand that that particular issue is something we ought to fix through evangelism, but I think that WHATWG/browser vendors can help with this while at the same time (rightly, smartly) making the case that the web of tomorrow should be a UTF-8 (and 16) based one, not a smorgasbord of different encodings. Anne has worked on speccing what exactly the smorgasbord should be. See http://wiki.whatwg.org/wiki/Web_Encodings . I think it's not realistic to drop encodings that are on the list of encodings you see in the encoding menu on http://validator.nu/?charset However, I think browsers should drop support for encodings that aren't already supported by all the major browsers, because such encodings only serve to enable browser-specific content and encoding proliferation. Regarding your (and 16) remark, considering my personal happiness at work, I'd prioritize the eradication of UTF-16 as an interchange encoding much higher than eradicating ASCII-based non-UTF-8 encodings that all major browsers support. I think suggesting a solution to the encoding problem while implying that UTF-16 is not a problem isn't particularly appropriate. :-) So hence my question whether any vendor has done any recent research in this. Mobile browsers seem to have followed desktop browsers in this; perhaps this topic was tested and researched in recent times as part of that, but I couldn't find any such data. The only real relevant thread of discussion around UTF-8 as a default was this one about Web Workers: http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-September/023197.html …which basically suggested that everyone is hugely in favor of UTF-8 and making it a default wherever possible. So how 'bout it? I think in order to comply with the Support Existing Content design principle (even if it unfortunately means that support is siloed by locale) and in order to make plans that are game theoretically reasonable (not taking steps that make users migrate to browsers that haven't taken the steps), I think we shouldn't change the fallback encodings from what the HTML5 spec says when it comes to loading text/html or text/plain content into a browsing context. What's going in this area, if anything? There's the effort to specify a set of encodings and their aliases for browsers to support. That's moving slowly, since Anne has other more important specs to work on. Other than that, there have been efforts to limit new features to UTF-8 only (consider scripts in Workers and App Cache manifests) and efforts to make new features not vary by locale-dependent defaults (consider HTML in XHR). Both these efforts have faced criticism, unfortunately. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
On Thu, Dec 1, 2011 at 8:29 PM, Brett Zamir bret...@yahoo.com wrote: How about a Compatibility Mode for the older non-UTF-8 character set approach, specific to page? That compatibility mode already exists: It's the default mode--just like the quirks mode is the default for pages that don't have a doctype. You opt out of the quirks mode by saying !DOCTYPE html. You opt out of the encoding compatibility mode by saying meta charset=utf-8. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Default encoding to UTF-8?
On Fri, Dec 2, 2011 at 10:46 AM, Henri Sivonen hsivo...@iki.fi wrote: Regarding your (and 16) remark, considering my personal happiness at work, I'd prioritize the eradication of UTF-16 as an interchange encoding much higher than eradicating ASCII-based non-UTF-8 encodings that all major browsers support. I think suggesting a solution to the encoding problem while implying that UTF-16 is not a problem isn't particularly appropriate. :-) UTF-16 is definitely terrible for interchange (it's terrible for internal use, too, but we're stuck with that), and I'm all for anything that prevents its proliferation. I don't think I'd call it a bigger problem, though, since it's comparatively (even vanishingly) rare, where untagged legacy encodings are a widespread problem that gets worse every day we can't think of a way to curtail it. I don't have any new ideas for doing that, either, though. I think in order to comply with the Support Existing Content design principle (even if it unfortunately means that support is siloed by locale) and in order to make plans that are game theoretically reasonable (not taking steps that make users migrate to browsers that haven't taken the steps), I think we shouldn't change the fallback encodings from what the HTML5 spec says when it comes to loading text/html or text/plain content into a browsing context. And no browser vendor would ever do this, no matter what the spec says, since nobody's willing to break massive swaths of existing content. -- Glenn Maynard
Re: [whatwg] Default encoding to UTF-8?
I have read section 4.2.5.5 of the WHATWG HTML spec and I think it is sufficient. It requires that any non-US-ASCII document has an explicit character encoding declaration. It also recommends UTF-8 for all new documents and for authoring tools' default encoding. Therefore, any document conforming to HTML5 should not pose any problem in this area. The default encoding issue is therefore for old stuff. But I have seen a lot of pages, in browsers and in mail, that were tagged with one encoding and encoded in another. Hence, documents without a charset declaration are only one of the reasons of garbage we see. Therefore, I see no point in trying to fix anything in browsers by changing the ancient defaults (risking compatibility issues). Energy should go into filing bugs against misbehaving authoring tools and into adding proper recommendations and education in HTML guidelines and tutorials. Thanks, Sergiusz On Thu, Dec 1, 2011 at 7:00 AM, L. David Baron dba...@dbaron.org wrote: On Thursday 2011-12-01 14:37 +0900, Mark Callow wrote: On 01/12/2011 11:29, L. David Baron wrote: The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. In my experience this is what causes most of the breakage. It leads people to create pages that do not specify the charset encoding. The page works fine in the creator's locale but shows mojibake (garbage characters) for anyone in a different locale. If the default was ASCII everywhere then all authors would see mojibake, unless it really was an ASCII-only page, which would force them to set the charset encoding correctly. Sure, if the default were consistent everywhere we'd be fine. If we have a choice in what that default is, UTF-8 is probably a good choice unless there's some advantage to another one. But nobody's figured out how to get from here to there. (I think this is legacy from the pre-Unicode days, when the browser simply displayed Web pages using to the system character set, which led to a legacy of incompatible Web pages in different parts of the world.) -David -- 턞 L. David Baron http://dbaron.org/ 턂 턢 Mozilla http://www.mozilla.org/ 턂
Re: [whatwg] Default encoding to UTF-8?
On 12/1/2011 2:00 PM, L. David Baron wrote: On Thursday 2011-12-01 14:37 +0900, Mark Callow wrote: On 01/12/2011 11:29, L. David Baron wrote: The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. In my experience this is what causes most of the breakage. It leads people to create pages that do not specify the charset encoding. The page works fine in the creator's locale but shows mojibake (garbage characters) for anyone in a different locale. If the default was ASCII everywhere then all authors would see mojibake, unless it really was an ASCII-only page, which would force them to set the charset encoding correctly. Sure, if the default were consistent everywhere we'd be fine. If we have a choice in what that default is, UTF-8 is probably a good choice unless there's some advantage to another one. But nobody's figured out how to get from here to there. How about a Compatibility Mode for the older non-UTF-8 character set approach, specific to page? I wholeheartedly agree that something should be done here, preventing yet more content from piling up in outdated ways without any consequences. (Same with email clients too, I would hope as well.) Brett
Re: [whatwg] Default encoding to UTF-8?
2011-12-01 1:28, Faruk Ates wrote: My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. Browsers default to various encodings, often windows-1252 (rather than ISO-8859-1). They may also investigate the actual data and make a guess based on it. I'm wondering if it might not be good to start encouraging defaulting to UTF-8, It would not. There’s no reason to recommend any particular defaulting, especially not something that deviates from past practices. It might be argued that browsers should do better error detection and reporting, so that they inform the user e.g. if the document’s encoding has not been declared at all and it cannot be inferred fairly reliably (e.g., from BOM). But I’m afraid the general feeling is that browsers should avoid warning users, as that tends to contradict authors’ purposes – and, in fact, mostly things that are serious problems in principle aren’t that serious in practice. We like to think that “every web developer is surely building things in UTF-8 nowadays” but this is far from true. There’s a large amount of pages declared as UTF-8 but containing Ascii only, as well as pages mislabeled as UTF-8 but containing e.g. ISO-8859-1. I still frequently break websites and webapps simply by entering my name (Faruk Ateş). That’s because the server-side software (and possibly client-side software) cannot handle the letter “ş”. It would not help if the page were interpreted as UTF-8. If the author knows that a server-side form Yucca
Re: [whatwg] Default encoding to UTF-8?
On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote: My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. But how relevant is that still today? Has any browser done any recent research into the need for this? The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. You can see Firefox's defaults here: http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default (The localization and platform are part of the filename.) I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago (by changing the intl.charset.default preference), and I do see a decent amount of broken content as a result (maybe I encounter a new broken page once a week? -- though substantially more often if I'm looking at non-English pages because of travel). I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin. The more complex the rules, the harder they are for authors to understand / debug. I wouldn't want to create rules like those. I would, however, like to see movement towards defaulting to UTF-8: the current situation makes the Web less world-wide because pages that work for one user don't work for another. I'm just not quite sure how to get from here to there, though, since such changes are likely to make users experience broken content. -David -- 턞 L. David Baron http://dbaron.org/ 턂 턢 Mozilla http://www.mozilla.org/ 턂
Re: [whatwg] Default encoding to UTF-8?
On 01/12/2011 11:29, L. David Baron wrote: The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. In my experience this is what causes most of the breakage. It leads people to create pages that do not specify the charset encoding. The page works fine in the creator's locale but shows mojibake (garbage characters) for anyone in a different locale. If the default was ASCII everywhere then all authors would see mojibake, unless it really was an ASCII-only page, which would force them to set the charset encoding correctly. Regards -Mark
Re: [whatwg] Default encoding to UTF-8?
On Thursday 2011-12-01 14:37 +0900, Mark Callow wrote: On 01/12/2011 11:29, L. David Baron wrote: The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. In my experience this is what causes most of the breakage. It leads people to create pages that do not specify the charset encoding. The page works fine in the creator's locale but shows mojibake (garbage characters) for anyone in a different locale. If the default was ASCII everywhere then all authors would see mojibake, unless it really was an ASCII-only page, which would force them to set the charset encoding correctly. Sure, if the default were consistent everywhere we'd be fine. If we have a choice in what that default is, UTF-8 is probably a good choice unless there's some advantage to another one. But nobody's figured out how to get from here to there. (I think this is legacy from the pre-Unicode days, when the browser simply displayed Web pages using to the system character set, which led to a legacy of incompatible Web pages in different parts of the world.) -David -- 턞 L. David Baron http://dbaron.org/ 턂 턢 Mozilla http://www.mozilla.org/ 턂