Re: [whatwg] Default encoding to UTF-8?
L. David Baron on Wed Nov 30 18:29:31 PST 2011: On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote: My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. But how relevant is that still today? Has any browser done any recent research into the need for this? The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. You can see Firefox's defaults here: http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default (The localization and platform are part of the filename.) Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Don't users of those locales travel as much as you do? Or do we consider the English locale user's as more important? Something is broken in the logics here! I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago (by changing the intl.charset.default preference), and I do see a decent amount of broken content as a result (maybe I encounter a new broken page once a week? -- though substantially more often if I'm looking at non-English pages because of travel). What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. What is your locale? It is probably English. Or do you consider your locale to be 'the Western world locale'? It sounds like *that* is what Anne has in mind when he brings in Dutch: http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Faruk probably uses a Spanish locale - right?, so the two of you are not speaking out of the same context. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin. The more complex the rules, the harder they are for authors to understand / debug. I wouldn't want to create rules like those. Agree that that particular idea is probably not the best. I would, however, like to see movement towards defaulting to UTF-8: the current situation makes the Web less world-wide because pages that work for one user don't work for another. I'm just not quite sure how to get from here to there, though, since such changes are likely to make users experience broken content. I think we should 'attack' the dominating locale first: The English locale, in its different incarnations (Australian, American, UK). Thus, we should turn things on the head: English users should start to expect UTF-8 to be used. Because, as English users, you are more used to 'mojibake' than the rest of us are: Whenever you see it, you 'know' that it is because it is a foreign language you are reading. It is we, the users of non-English locales, that need the default-to-legacy encoding behavior the most. Or, please, explain to us when and where it is important that English language users living in their own, native lands so to speak, need that their browser default to Latin-1 so that they can correctly read English language pages? If the English locales start defaulting to UTF-8, then little by little, the same expectation etc will start spreading to the other locales as well, not least because the 'geeks' of each locale will tend to see the English locale as a super default - and they might also use the US English locale of their OS and/or browser. We should not consider the needs of geeks - they will follow (read: lead) the way, so the fact that *they* may see mojibake, should not be a concern. See? We would have a plan. Or what do you think? Of course, we - or rather: the
Re: [whatwg] Default encoding to UTF-8?
(And HTML5 defines it the same.) No. As far as I understand, HTML5 defines US-ASCII to be the default and requires that any other encoding is explicitly declared. I do like this approach. We should also lobby for authoring tools (as recommended by HTML5) to default their output to UTF-8 and make sure the encoding is declared. As so many pages, supposedly (I have not researched this), use the incorrect encoding, it makes no sense to try to clean this mess by messing with existing defaults. It may fix some pages and break others. Browsers have the ability to override an incorrect encoding and this a reasonable workaround. -- Sergiusz On Mon, Dec 5, 2011 at 6:42 PM, Leif Halvard Silli xn--mlform-...@xn--mlform-iua.no wrote: L. David Baron on Wed Nov 30 18:29:31 PST 2011: On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote: My understanding is that all browsers* default to Western Latin (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due to legacy content on the web. But how relevant is that still today? Has any browser done any recent research into the need for this? The default varies by localization (and within that potentially by platform), and unfortunately that variation does matter. You can see Firefox's defaults here: http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default (The localization and platform are part of the filename.) Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Don't users of those locales travel as much as you do? Or do we consider the English locale user's as more important? Something is broken in the logics here! I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago (by changing the intl.charset.default preference), and I do see a decent amount of broken content as a result (maybe I encounter a new broken page once a week? -- though substantially more often if I'm looking at non-English pages because of travel). What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. What is your locale? It is probably English. Or do you consider your locale to be 'the Western world locale'? It sounds like *that* is what Anne has in mind when he brings in Dutch: http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Faruk probably uses a Spanish locale - right?, so the two of you are not speaking out of the same context. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. I'm wondering if it might not be good to start encouraging defaulting to UTF-8, and only fallback to Western Latin if it is detected that the content is very old / served by old infrastructure or servers, etc. And of course if the content is served with an explicit encoding of Western Latin. The more complex the rules, the harder they are for authors to understand / debug. I wouldn't want to create rules like those. Agree that that particular idea is probably not the best. I would, however, like to see movement towards defaulting to UTF-8: the current situation makes the Web less world-wide because pages that work for one user don't work for another. I'm just not quite sure how to get from here to there, though, since such changes are likely to make users experience broken content. I think we should 'attack' the dominating locale first: The English locale, in its different incarnations (Australian, American, UK). Thus, we should turn things on the head: English users should start to expect UTF-8 to be used. Because, as English users, you are more used to 'mojibake' than the rest of us are: Whenever you see it, you 'know' that it is because it is a foreign language you are reading. It is we, the users of non-English locales, that need the default-to-legacy encoding behavior the most. Or, please, explain to us
Re: [whatwg] Default encoding to UTF-8?
(And HTML5 defines it the same.) No. As far as I understand, HTML5 defines US-ASCII to be the default and requires that any other encoding is explicitly declared. I do like this approach. We are here discussing the default *user agent behaviour* - we are not specifically discussing how web pages should be authored. For use agents, then please be aware that HTML5 maintains a table over 'Suggested default encoding': http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding When you say 'requires': Of course, HTML5 recommends that you declare the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via meta charset=UTF-8). I just now also discovered that Validator.nu issues an error message if it does not find any of of those *and* the document contains non-ASCII. (I don't know, however, whether this error message is just something Henri added at his own discretion - it would be nice to have it literally in the spec too.) (The problem is of course that many English pages expect the whole Unicode alphabet even if they only contain US-ASCII from the start.) HTML5 says that validators *may* issue a warning if UTF-8 is *not* the encoding. But so far, validator.nu has not picked that up. We should also lobby for authoring tools (as recommended by HTML5) to default their output to UTF-8 and make sure the encoding is declared. HTML5 already says: Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629] http://dev.w3.org/html5/spec/semantics.html#charset As so many pages, supposedly (I have not researched this), use the incorrect encoding, it makes no sense to try to clean this mess by messing with existing defaults. It may fix some pages and break others. Browsers have the ability to override an incorrect encoding and this a reasonable workaround. Do you use a English locale computer? If you do, without being a native English speaker, then you are some kind of geek ... Why can't you work around the troubles -as you are used to anyway? Starting a switch to UTF-8 as the default UA encoding for English locale users should *only* affect how English locale users experience languages which *both* need non-ASCII *and* historically have been using Windows-1252 as the default encoding *and* which additionally do not include any encoding declaration. -- Leif Halvard Silli
[whatwg] object, type, and fallback
I can't find a definitive answer for the following scenario: 1 - A page has a plug-in with fallback specified as follows: object type=application/x-shockwave-flash param name=movie value=Example.swf/ img src=Fallback.png /object 2 - The page is loaded, the browser instantiates the plug-in, and the plug-in content is shown. 3 - A script later comes along and dynamically changes the object's type attribute to application/some-unsupported-type Should the browser dynamically and immediately switch from the plug-in to the fallback image? If not, what should it do? And is this specified anywhere? Thanks, ~Brady
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 12:42 PM, Leif Halvard Silli wrote: Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Because authors authoring pages that users of those locales tend to use use UTF-8 more than anything else? Don't users of those locales travel as much as you do? People on average travel less than David does, yes. In all locales. But that's not the point. I think you completely misunderstood his comments about travel and locales. Keep reading. What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. No. He's describing a problem using UTF-8 to view pages that are not written in English. Now what language are the non-English pages you look at written in? Well, it depends. In western Europe they tend to be in languages that can be encoded in ISO-8859-1, so authors sometimes use that encoding (without even realizing it). If you set your browser to default to UTF-8, those pages will be broken. In Japan, a number of pages are authored in Shift_JIS. Those will similarly be broken in a browser defaulting to UTF-8. What is your locale? Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) That's exactly what it is. Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Again, doesn't matter if you change your settings from the default. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? Did you miss the travel part? Want to look up web pages for museums, airports, etc in a non-English speaking country? There's a good chance they're not in English! This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. Yes. So? I think we should 'attack' the dominating locale first: The English locale, in its different incarnations (Australian, American, UK). Thus, we should turn things on the head: English users should start to expect UTF-8 to be used. Because, as English users, you are more used to 'mojibake' than the rest of us are: Whenever you see it, you 'know' that it is because it is a foreign language you are reading. Modulo smart quotes (and recently unicode ellipsis characters). These are actually pretty common in English text on the web nowadays, and have a tendency to be in ISO-8859-1. Or, please, explain to us when and where it is important that English language users living in their own, native lands so to speak, need that their browser default to Latin-1 so that they can correctly read English language pages? See above. See? We would have a plan. Or what do you think? Try it in your browser. When I set UTF-8 as my default, there were broke quotation marks all over the web for me. And I'm talking pages in English. -Boris
Re: [whatwg] Default encoding to UTF-8?
Boris Zbarsky Mon Dec 5 13:49:45 PST 2011: On 12/5/11 12:42 PM, Leif Halvard Silli wrote: Last I checked, some of those locales defaulted to UTF-8. (And HTML5 defines it the same.) So how is that possible? Because authors authoring pages that users of those locales tend to use use UTF-8 more than anything else? It is more likely that there is another reason, IMHO: They may have tried it, and found that it worked OK. But they of course have the same need for reading non-English museum and railway pages as Mozilla employees. Don't users of those locales travel as much as you do? I think you completely misunderstood his comments about travel and locales. Keep reading. I'm pretty sure I haven't misunderstood very much. What kind of trouble are you actually describing here? You are describing a problem with using UTF-8 for *your locale*. No. He's describing a problem using UTF-8 to view pages that are not written in English. And why is that a problem in those cases when it is a problem? Do he read those languages, anyway? Don't we expect some problems when we thread out of our borders? Now what language are the non-English pages you look at written in? Well, it depends. In western Europe they tend to be in languages that can be encoded in ISO-8859-1, so authors sometimes use that encoding (without even realizing it). If you set your browser to default to UTF-8, those pages will be broken. In Japan, a number of pages are authored in Shift_JIS. Those will similarly be broken in a browser defaulting to UTF-8. The solution I proposed was that English locale browsers should default to UTF-8. Of course, to such users, then when in Japan, they could get problems - on some Japanese pages, which is a small nuisance, especially if they read Japansese. What is your locale? Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. The locale is meant to predominantly be used within a physical locale. If he is at another physical locale or a virtually other locale, he should not be expecting that it works out of the box unless a common encoding is used. Even today, if he visits Japan, he has to either change his browser settings *or* to rely on the pages declaring their encodings. So nothing would change, for him, when visiting Japan — with his browser or with his computer. Yes, there would be a change, w.r.t. Enlgish quotation marks (see below) and w.r.tg. visiting Western European languages pages: For those a number of pages which doesn't fail with Win-1252 as the default, would start to fail. But relatively speaking, it is less important that non-English pages fail for the English locale. (Quite often it sounds as if some see Latin-1 - or Windows-1251 as we now should say - as a 'super default' rather than a locale default. If that is the case, that it is a super default, then we should also spec it like that! Until further, I'll treat Latin-1 as it is specced: As a default for certain locales.) That's exactly what it is. A default for certain locales? Right. Since it is a locale problem, we need to understand which locale you have - and/or which locale you - and other debaters - think they have. Again, doesn't matter if you change your settings from the default. I don't think I have misunderstood anything. However, you also say that your problem is not so much related to pages written for *your* locale as it is related for pages written for users of *other* locales. So how many times per year do Dutch, Spanish or Norwegian - and other non-English pages - are creating troubles for you, as a English locale user? I am making an assumption: Almost never. You don't read those languages, do you? Did you miss the travel part? Want to look up web pages for museums, airports, etc in a non-English speaking country? There's a good chance they're not in English! There is a very good chance, also, that only very few of the Web pages for such professional institutions would fail to declare their encoding. This is also an expectation thing: If you visit a Russian page in a legacy Cyrillic encoding, and gets mojibake because your browser defaults to Latin-1, then what does it matter to you whether your browser defaults to Latin-1 or UTF-8? Answer: Nothing. Yes. So? So we can look away from Greek, Cyrillic, Japanese, Chinese etc etc in this debate. The eventually only benefit for English locale user of keeping WIN-1252 as the default, is that they can have a tiny number of fewer problems when visiting Western-European language web pages with their computer. (Yes, fI saw that you mention smart quotes etc below - so there is that reason too.) I think we should 'attack' the dominating
Re: [whatwg] Default encoding to UTF-8?
On Fri, 02 Dec 2011 15:50:31 -, Henri Sivonen hsivo...@iki.fi wrote: That compatibility mode already exists: It's the default mode--just like the quirks mode is the default for pages that don't have a doctype. You opt out of the quirks mode by saying !DOCTYPE html. You opt out of the encoding compatibility mode by saying meta charset=utf-8. Could !DOCTYPE html be an opt-in to default UTF-8 encoding? It would be nice to minimize number of declarations a page needs to include. -- regards, Kornel Lesiński
Re: [whatwg] Default encoding to UTF-8?
On Dec 5, 2011, at 4:10 PM, Kornel Lesiński wrote: Could !DOCTYPE html be an opt-in to default UTF-8 encoding? It would be nice to minimize number of declarations a page needs to include. I like that idea. Maybe it’s not too late. -- Darin
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 6:14 PM, Leif Halvard Silli wrote: It is more likely that there is another reason, IMHO: They may have tried it, and found that it worked OK Where by it you mean open a text editor, type some text, and save. So they get whatever encoding their OS and editor defaults to. And yes, then they find that it works ok, so they don't worry about encodings. No. He's describing a problem using UTF-8 to view pages that are not written in English. And why is that a problem in those cases when it is a problem? Because the characters are wrong? Do he read those languages, anyway? Do you read English? Seriously, what are you asking there, exactly? (For the record, reading a particular page in a language is a much simpler task than reading the language; I can't read German, but I can certainly read a German subway map.) The solution I proposed was that English locale browsers should default to UTF-8. I know the solution you proposed. That solution tries to avoid the issues David was describing by only breaking things for people in English browser locales, I understand that. Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. The locale is meant to predominantly be used within a physical locale. Yes, so? If he is at another physical locale or a virtually other locale, he should not be expecting that it works out of the box unless a common encoding is used. He was responding to a suggestion that the default encoding be changed to UTF-8 for all locales. Are you _really_ sure you understood the point of his mail? Even today, if he visits Japan, he has to either change his browser settings *or* to rely on the pages declaring their encodings. So nothing would change, for him, when visiting Japan — with his browser or with his computer. He wasn't saying it's a problem for him per se. He's a somewhat sophisticated browser user who knows how to change the encoding for a particular page. What he was saying is that there are lots of pages out there that aren't encoded in UTF-8 and rely on locale fallbacks to particular encodings, and that he's run into them a bunch while traveling in particular, so they were not pages in English. So far, you and he seem to agree. Yes, there would be a change, w.r.t. Enlgish quotation marks (see below) and w.r.tg. visiting Western European languages pages: For those a number of pages which doesn't fail with Win-1252 as the default, would start to fail. But relatively speaking, it is less important that non-English pages fail for the English locale. No one is worried about that, particularly. There is a very good chance, also, that only very few of the Web pages for such professional institutions would fail to declare their encoding. You'd be surprised. Modulo smart quotes (and recently unicode ellipsis characters). These are actually pretty common in English text on the web nowadays, and have a tendency to be in ISO-8859-1. If we change the default, they will start to tend to be in UTF-8. Not unless we change the authoring tools. Half the time these things are just directly exported from a word processor. OK: Quotation marks. However, in 'old web pages', then you also find much more use of HTML entities (such asldquo;) than you find today. We should take advantage of that, no? I have no idea what you're trying to say, When you mention quotation marks, then you mention a real locale related issue. And may be the Euro sign too? Not an issue for me personally, but it could be for some, yes. Nevertheless, the problem is smallest for languages that primarily limit their alphabet to those letter that are present in the American Standard Code for Information Interchange format. Sure. It may still be too big. It would be logical, thus, to start the switch to UTF-8 for those locales If we start at all. Perhaps we need to have a project to measure these problems, instead of all these anecdotes? Sure. More data is always better than ancedotes. -Boris
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 6:14 PM, Leif Halvard Silli wrote: It is more likely that there is another reason, IMHO: They may have tried it, and found that it worked OK Where by it you mean open a text editor, type some text, and save. So they get whatever encoding their OS and editor defaults to. If that is all they tested, then I'd said they did not test enough. And yes, then they find that it works ok, so they don't worry about encodings. Ditto. No. He's describing a problem using UTF-8 to view pages that are not written in English. And why is that a problem in those cases when it is a problem? Because the characters are wrong? But the characters will be wrong many more times than exactly those times when he tries to read a Web page with a Western European languages that is not declared as WIN-1252. Does English locale uses have particular expectations with regard to exactly those Web pages? What about Polish Web pages etc? English locale users is a very multiethnic lot. Do he read those languages, anyway? Do you read English? Seriously, what are you asking there, exactly? Because if it is an issue, then it is an about expectations for exactly those pages. (Plus the quote problem, of course.) (For the record, reading a particular page in a language is a much simpler task than reading the language; I can't read German, but I can certainly read a German subway map.) Or Polish subway map - which doesn't default to said encoding. The solution I proposed was that English locale browsers should default to UTF-8. I know the solution you proposed. That solution tries to avoid the issues David was describing by only breaking things for people in English browser locales, I understand that. That characterization is only true with regard to the quote problem. That German pages breaks would not be any more important than the fact that Polish pages would. For that matter: It happens that UTF-8 pages breaks as well. I only suggest it as a first step, so to speak. Or rather - since some locales apparently already default to UTF-9 - as a next step. Thereafter, more locales would be expected to follow suit - as the development of each locale permits. Why does it matter? David's default locale is almost certainly en-US, which defaults to ISO-8859-1 (or whatever Windows-??? encoding that actually means on the web) in his browser. But again, he's changed the default encoding from the locale default, so the locale is irrelevant. The locale is meant to predominantly be used within a physical locale. Yes, so? So then we have a set of expectations for the language of that locale. If we look at how the locale settings handles other languages, then we are outside the issue that the locale specific encodings are supposed to handle. If he is at another physical locale or a virtually other locale, he should not be expecting that it works out of the box unless a common encoding is used. He was responding to a suggestion that the default encoding be changed to UTF-8 for all locales. Are you _really_ sure you understood the point of his mail? I said I agreed with him that Faruk's solution was not good. However, I would not be against treating DOCTYPE html as a 'default to UTF-8' declaration, as suggested by some - if it were possible to agree about that. Then we could keep things as they are, except for the HTML5 DOCTYPE. I guess the HTML5 doctype would become 'the default before the default': If everything else fails, then UTF-8 if the DOCTYPE is !DOCTYPE html, or else, the locale default. It sounded like Darin Adler thinks it possible. How about you? Even today, if he visits Japan, he has to either change his browser settings *or* to rely on the pages declaring their encodings. So nothing would change, for him, when visiting Japan — with his browser or with his computer. He wasn't saying it's a problem for him per se. He's a somewhat sophisticated browser user who knows how to change the encoding for a particular page. If we are talking about English locale user visiting Japan, then I doubt a change in the default encoding would matter - Win-1252 as default would anyway be wrong. What he was saying is that there are lots of pages out there that aren't encoded in UTF-8 and rely on locale fallbacks to particular encodings, and that he's run into them a bunch while traveling in particular, so they were not pages in English. So far, you and he seem to agree. So far we agree, yes. Yes, there would be a change, w.r.t. Enlgish quotation marks (see below) and w.r.tg. visiting Western European languages pages: For those a number of pages which doesn't fail with Win-1252 as the default, would start to fail. But relatively speaking, it is less important that non-English pages fail for the English locale. No one is worried about that, particularly. You spoke about visiting German pages above - sounded like you worried, but
Re: [whatwg] Default encoding to UTF-8?
On 12/5/11 9:55 PM, Leif Halvard Silli wrote: If that is all they tested, then I'd said they did not test enough. That's normal for the web. (For the record, reading a particular page in a language is a much simpler task than reading the language; I can't read German, but I can certainly read a German subway map.) Or Polish subway map - which doesn't default to said encoding. Indeed. I don't think anyone thinks the existing situation is all fine or anything. I said I agreed with him that Faruk's solution was not good. However, I would not be against treatingDOCTYPE html as a 'default to UTF-8' declaration This might work, if there hasn't been too much cargo-culting yet. Data urgently needed! Not unless we change the authoring tools. Half the time these things are just directly exported from a word processor. Please educate me. I'm perhaps 'handicapped' in that regard: I haven't used MS Word on a regular basis since MS Word 5.1 for Mac. Also, if export means copy and paste It can mean that, or save as HTML followed by copy and paste. then on the Mac, everything gets converted via the clipboard On Mac, the default OS encoding is UTF-8 last I checked. That's decidedly not the case on Windows. OK: Quotation marks. However, in 'old web pages', then you also find much more use of HTML entities (such as“) than you find today. We should take advantage of that, no? I have no idea what you're trying to say, Sorry. What I meant was that character entities are encoding independent. Yes. And that lots of people - and authoring tools - have inserted non-ASCII letters and characters as character entities, Sure. And lots have inserted them directly. At any rate: A page which uses character entities for non-ascii would render the same regardless of encoding, hence a switch to UTF-8 would not matter for those. Sure. We're not worried about such pages here. -Boris
Re: [whatwg] Default encoding to UTF-8?
Boris Zbarsky Mon Dec 5 19:18:10 PST 2011: On 12/5/11 9:55 PM, Leif Halvard Silli wrote: I said I agreed with him that Faruk's solution was not good. However, I would not be against treating DOCTYPE html as a 'default to UTF-8' declaration This might work, if there hasn't been too much cargo-culting yet. Data urgently needed! Yeah, it would be a pity if it had already become an widespread cargo-cult to - all at once - use HTML5 doctype without using UTF-8 *and* without using some encoding declaration *and* thus effectively relying on the default locale encoding ... Who does have a data corpus? Henri, as Validator.nu developer? This change would involve adding one more step in the HTML5 parser's encoding sniffing algorithm. [1] The question then is when, upon seeing the HTML5 doctype, the default to UTF-8 ought to happen, in order to be useful. It seems it would have to happen after the processing of the explicit meta data (Step 1 to 5) but before the last 3 steps - step 6, 7 and 8: Step 6: 'if the user agent has information on the likely encoding' Step 7: UA 'may attempt to autodetect the character encoding' Step 8: 'implementation-defined or user-specified default' The role of the HTML5 DOCTYPE, encoding wise, would then be to ensure that step 6 to 8 does not happen. [1] http://dev.w3.org/html5/spec/parsing#encoding-sniffing-algorithm -- Leif H Silli
Re: [whatwg] object, type, and fallback
On Mon, 05 Dec 2011 22:19:33 +0100, Brady Eidson beid...@apple.com wrote: I can't find a definitive answer for the following scenario: 1 - A page has a plug-in with fallback specified as follows: object type=application/x-shockwave-flash param name=movie value=Example.swf/ img src=Fallback.png /object 2 - The page is loaded, the browser instantiates the plug-in, and the plug-in content is shown. 3 - A script later comes along and dynamically changes the object's type attribute to application/some-unsupported-type Should the browser dynamically and immediately switch from the plug-in to the fallback image? If not, what should it do? And is this specified anywhere? Thanks, ~Brady ... when neither its classid attribute nor its data attribute are present, whenever its type attribute is set, changed, or removed: the user agent must queue a task to run the following steps to (re)determine what the object element represents. The task source for this task is the DOM manipulation task source. http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#the-object-element The algorithm then determines in step 5 that there's no suitable plugin, and falls back. -- Simon Pieters Opera Software
[whatwg] CSP sandbox directive integration with HTML
I wrote some somewhat goofy text in the CSP spec trying to integrate the sandbox directive with HTML's iframe sandbox machinery. Hixie and I chatted in #whatwg about how best to do the integration. I think Hixie is going to refactor the machinery in the spec to be a bit more generic and to call out to the CSP spec to get the sandbox flags from the CSP policy. There are more details in the IRC log below. Thanks, Adam [06:43am] abarth: Hixie: do you have a moment to tell me how nutty this text about sandbox flags is? http://dvcs.w3.org/hg/content-security-policy/raw-file/tip/csp-specification.dev.html#sandbox [06:43am] abarth: When enforcing the sandbox directive, the user agent must set the sandbox flags for the protected document as if the document where contained in a nested browsing context within a document with sandbox flags given by the the directive-value. [06:45am] Hixie: hrm [06:45am] abarth: i don't think its quite right [06:45am] abarth: i couldn't find a good hook in HTML for this [06:45am] Hixie: what you probably want to do is set some hook that i can then do the right magic with [06:46am] Hixie: rather than try to poke the html spec flags [06:46am] abarth: ok [06:46am] Hixie: because the flags you have to set are pretty complex and subtle [06:46am] Hixie: and involve the navigation algorithm, etc [06:46am] abarth: how about the CSP sandbox flags as a property of a Document [06:46am] abarth: which will be a string like you'd get in the iframe attribute? [06:46am] abarth: so HTML handles the parsing [06:46am] Hixie: has to be on a browsing context, not a document [06:46am] Hixie: doesn't make sense to sandbox a document [06:46am] abarth: why not? [06:47am] abarth: sorry, let me ask a different question [06:47am] abarth: is a browsing context preserved across navigations? [06:47am] Hixie: yes [06:48am] Hixie: but the flags can change during the lifetime of the browsing context [06:48am] abarth: ah [06:48am] abarth: ok [06:48am] Hixie: what matters to all teh security stuff is the state when the browsing context was last navigated [06:49am] Hixie: e.g. if... its browsing context had its sandboxed forms browsing context flag set when the Document was created ... [06:49am] abarth: i see [06:49am] Margle joined the chat room. [06:49am] Hixie: but the net result is that you have to set the flags before the document is created [06:49am] abarth: do we have the response headers when the document is created? [06:49am] Hixie: er, before the Document is created [06:49am] Hixie: sure [06:49am] Hixie: assuming it came over HTTP [06:50am] abarth: ok, so when the document is created, HTML needs to ask about the CSP policy for the document [06:50am] abarth: or for the response [06:50am] Hixie: we get the headers by navigate step 19 or so (type sniffing step), we create the document as a side-effect of step 20 (the switch statement that relies on the sniffed type) [06:51am] abarth: Upon receiving an HTTP response containing ... [06:51am] abarth: that's when the CSP policy starts getting enforced [06:51am] abarth: Upon receiving an HTTP response containing at least one Content-Security-Policy header field, the user agent must enforce the combination of all the policies contained in these header fields. [06:52am] Hixie: so... what happens if the page navigates itself to a page without the CSP? [06:52am] Hixie: or does a history.back() to a accomplice page that isn't sandboxed? [06:52am] abarth: that's fine [06:53am] abarth: consider the unique-origin sandbox bits [06:53am] abarth: or the disable-script [06:53am] Hixie: k [06:53am] abarth: those make sense on a per-document basisi [06:53am] Hixie: so when do we reset the flags? [06:53am] abarth: each navigation [06:54am] abarth: what actually happens in the implementation is that we copy the sandbox flags from the Frame to the Document when the document is created [06:54am] abarth: because we're supposed to freeze the sandbox flags [06:54am] abarth: we enquire about the CSP policy at that time [06:54am] abarth: that happens each time a new document is loaded into a Frame [06:54am] Hixie: hmm... the document is created before the session history change happens [06:55am] Hixie: so we'd have to reset the flags before the old document is removed... [06:55am] Hixie: might make sense to just set the flags temporarily while the document is being created or something [06:55am] Hixie: how is this supposed to interact with the sandbox attribute? union? [06:55am] abarth: can we not just set them on the document when we copy the state to the document? [06:56am] abarth: Hixie: its the same combination operator that happens when you have nested iframes [06:56am] abarth: that each contribute a sandbox attribute [06:57am] Hixie: hmmm [06:57am] Hixie: so the way it works for nested iframes is that setting the flag on an iframe just forces it on for all descendants iframes [06:58am] abarth: yeah, so the union [06:58am] abarth: (assuming the items are things like sandboxed
Re: [whatwg] Fixing undo on the Web - UndoManager and Transaction
Hi all, I've added more examples to the document: http://rniwa.com/editing/undomanager.html and also requested feedback on public-webapps. As of this revision, I consider the specification is ready for implementation feedback. I will start prototyping it for WebKit and start writing tests. I also welcome your test cases if you have any (do I need to setup a repo for this?). Best regards, Ryosuke Niwa Software Engineer Google Inc.