Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
L. David Baron on Wed Nov 30 18:29:31 PST 2011:
 On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
 My understanding is that all browsers* default to Western Latin
 (ISO-8859-1) encoding by default (for Western-world
 downloads/OSes) due to legacy content on the web. But how relevant
 is that still today? Has any browser done any recent research into
 the need for this?
 
 The default varies by localization (and within that potentially by
 platform), and unfortunately that variation does matter.  You can
 see Firefox's defaults here:
 http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
 (The localization and platform are part of the filename.)

Last I checked, some of those locales defaulted to UTF-8. (And HTML5 
defines it the same.) So how is that possible? Don't users of those 
locales travel as much as you do? Or do we consider the English locale 
user's as more important? Something is broken in the logics here!

 I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
 (by changing the intl.charset.default preference), and I do see a
 decent amount of broken content as a result (maybe I encounter a new
 broken page once a week? -- though substantially more often if I'm
 looking at non-English pages because of travel).

What kind of trouble are you actually describing here? You are 
describing a problem with using UTF-8 for *your locale*. What is your 
locale? It is probably English. Or do you consider your locale to be 
'the Western world locale'? It sounds like *that* is what Anne has in 
mind when he brings in Dutch: 
http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as 
if some see Latin-1 - or Windows-1251 as we now should say - as a 
'super default' rather than a locale default. If that is the case, that 
it is a super default, then we should also spec it like that! Until 
further, I'll treat Latin-1 as it is specced: As a default for certain 
locales.)

Since it is a locale problem, we need to understand which locale you 
have - and/or which locale you - and other debaters - think they have. 
Faruk probably uses a Spanish locale - right?, so the two of you are 
not speaking out of the same context. 

However, you also say that your problem is not so much related to pages 
written for *your* locale as it is related for pages written for users 
of *other* locales. So how many times per year do Dutch, Spanish or 
Norwegian  - and other non-English pages - are creating troubles for 
you, as a English locale user? I am making an assumption: Almost never. 
You don't read those languages, do you? 

This is also an expectation thing: If you visit a Russian page in a 
legacy Cyrillic encoding, and gets mojibake because your browser 
defaults to Latin-1, then what does it matter to you whether your 
browser defaults to Latin-1 or UTF-8? Answer: Nothing. 

 I'm wondering if it might not be good to start encouraging
 defaulting to UTF-8, and only fallback to Western Latin if it is
 detected that the content is very old / served by old
 infrastructure or servers, etc. And of course if the content is
 served with an explicit encoding of Western Latin.
 
 The more complex the rules, the harder they are for authors to
 understand / debug.  I wouldn't want to create rules like those.

Agree that that particular idea is probably not the best.
 
 I would, however, like to see movement towards defaulting to UTF-8:
 the current situation makes the Web less world-wide because pages
 that work for one user don't work for another.
 
 I'm just not quite sure how to get from here to there, though, since
 such changes are likely to make users experience broken content.

I think we should 'attack' the dominating locale first: The English 
locale, in its different incarnations (Australian, American, UK). Thus, 
we should turn things on the head: English users should start to expect 
UTF-8 to be used. Because, as English users, you are more used to 
'mojibake' than the rest of us are: Whenever you see it, you 'know' 
that it is because it is a foreign language you are reading. It is we, 
the users of non-English locales, that need the default-to-legacy 
encoding behavior the most. Or, please, explain to us when and where it 
is important that English language users living in their own, native 
lands so to speak, need that their browser default to Latin-1 so that 
they can correctly read English language pages?

If the English locales start defaulting to UTF-8, then little by 
little, the same expectation etc will start spreading to the other 
locales as well, not least because the 'geeks' of each locale will tend 
to see the English locale as a super default - and they might also use 
the US English locale of their OS and/or browser. We should not 
consider the needs of geeks - they will follow (read: lead) the way, so 
the fact that *they* may see mojibake, should not be a concern.

See? We would have a plan. Or what do you think? Of course, we - or 
rather: the 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Sergiusz Wolicki
 (And HTML5 defines it the same.)

No. As far as I understand, HTML5 defines US-ASCII to be the default and
requires that any other encoding is explicitly declared. I do like this
approach.

We should also lobby for authoring tools (as recommended by HTML5) to
default their output to UTF-8 and make sure the encoding is declared.  As
so many pages, supposedly (I have not researched this), use the incorrect
encoding, it makes no sense to try to clean this mess by messing with
existing defaults. It may fix some pages and break others. Browsers have
the ability to override an incorrect encoding and this a reasonable
workaround.


-- Sergiusz


On Mon, Dec 5, 2011 at 6:42 PM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no wrote:

 L. David Baron on Wed Nov 30 18:29:31 PST 2011:
  On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
  My understanding is that all browsers* default to Western Latin
  (ISO-8859-1) encoding by default (for Western-world
  downloads/OSes) due to legacy content on the web. But how relevant
  is that still today? Has any browser done any recent research into
  the need for this?
 
  The default varies by localization (and within that potentially by
  platform), and unfortunately that variation does matter.  You can
  see Firefox's defaults here:
 
 http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
  (The localization and platform are part of the filename.)

 Last I checked, some of those locales defaulted to UTF-8. (And HTML5
 defines it the same.) So how is that possible? Don't users of those
 locales travel as much as you do? Or do we consider the English locale
 user's as more important? Something is broken in the logics here!

  I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
  (by changing the intl.charset.default preference), and I do see a
  decent amount of broken content as a result (maybe I encounter a new
  broken page once a week? -- though substantially more often if I'm
  looking at non-English pages because of travel).

 What kind of trouble are you actually describing here? You are
 describing a problem with using UTF-8 for *your locale*. What is your
 locale? It is probably English. Or do you consider your locale to be
 'the Western world locale'? It sounds like *that* is what Anne has in
 mind when he brings in Dutch:
 http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as
 if some see Latin-1 - or Windows-1251 as we now should say - as a
 'super default' rather than a locale default. If that is the case, that
 it is a super default, then we should also spec it like that! Until
 further, I'll treat Latin-1 as it is specced: As a default for certain
 locales.)

 Since it is a locale problem, we need to understand which locale you
 have - and/or which locale you - and other debaters - think they have.
 Faruk probably uses a Spanish locale - right?, so the two of you are
 not speaking out of the same context.

 However, you also say that your problem is not so much related to pages
 written for *your* locale as it is related for pages written for users
 of *other* locales. So how many times per year do Dutch, Spanish or
 Norwegian  - and other non-English pages - are creating troubles for
 you, as a English locale user? I am making an assumption: Almost never.
 You don't read those languages, do you?

 This is also an expectation thing: If you visit a Russian page in a
 legacy Cyrillic encoding, and gets mojibake because your browser
 defaults to Latin-1, then what does it matter to you whether your
 browser defaults to Latin-1 or UTF-8? Answer: Nothing.

  I'm wondering if it might not be good to start encouraging
  defaulting to UTF-8, and only fallback to Western Latin if it is
  detected that the content is very old / served by old
  infrastructure or servers, etc. And of course if the content is
  served with an explicit encoding of Western Latin.
 
  The more complex the rules, the harder they are for authors to
  understand / debug.  I wouldn't want to create rules like those.

 Agree that that particular idea is probably not the best.

  I would, however, like to see movement towards defaulting to UTF-8:
  the current situation makes the Web less world-wide because pages
  that work for one user don't work for another.
 
  I'm just not quite sure how to get from here to there, though, since
  such changes are likely to make users experience broken content.

 I think we should 'attack' the dominating locale first: The English
 locale, in its different incarnations (Australian, American, UK). Thus,
 we should turn things on the head: English users should start to expect
 UTF-8 to be used. Because, as English users, you are more used to
 'mojibake' than the rest of us are: Whenever you see it, you 'know'
 that it is because it is a foreign language you are reading. It is we,
 the users of non-English locales, that need the default-to-legacy
 encoding behavior the most. Or, please, explain to us 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
 (And HTML5 defines it the same.)
 
 No. As far as I understand, HTML5 defines US-ASCII to be the default and
 requires that any other encoding is explicitly declared. I do like this
 approach.

We are here discussing the default *user agent behaviour* - we are not 
specifically discussing how web pages should be authored.

For use agents, then please be aware that HTML5 maintains a table over 
'Suggested default encoding': 
http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

When you say 'requires': Of course, HTML5 recommends that you declare 
the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via 
meta charset=UTF-8). I just now also discovered that Validator.nu 
issues an error message if it does not find any of of those *and* the 
document contains non-ASCII. (I don't know, however, whether this error 
message is just something Henri added at his own discretion - it would 
be nice to have it literally in the spec too.)

(The problem is of course that many English pages expect the whole 
Unicode alphabet even if they only contain US-ASCII from the start.)

HTML5 says that validators *may* issue a warning if UTF-8 is *not* the 
encoding. But so far, validator.nu has not picked that up.
 
 We should also lobby for authoring tools (as recommended by HTML5) to
 default their output to UTF-8 and make sure the encoding is declared.

HTML5 already says: Authoring tools should default to using UTF-8 for 
newly-created documents. [RFC3629] 
http://dev.w3.org/html5/spec/semantics.html#charset

 As
 so many pages, supposedly (I have not researched this), use the incorrect
 encoding, it makes no sense to try to clean this mess by messing with
 existing defaults. It may fix some pages and break others. Browsers have
 the ability to override an incorrect encoding and this a reasonable
 workaround.

Do you use a English locale computer? If you do, without being a native 
English speaker, then you are some kind of geek ... Why can't you work 
around the troubles -as you are used to anyway?

Starting a switch to UTF-8 as the default UA encoding for English 
locale users should *only* affect how English locale users experience 
languages which *both* need non-ASCII *and* historically have been 
using Windows-1252 as the default encoding *and* which additionally do 
not include any encoding declaration.
-- 
Leif Halvard Silli


[whatwg] object, type, and fallback

2011-12-05 Thread Brady Eidson
I can't find a definitive answer for the following scenario:

1 - A page has a plug-in with fallback specified as follows:

object type=application/x-shockwave-flash
param name=movie value=Example.swf/
img src=Fallback.png
/object

2 - The page is loaded, the browser instantiates the plug-in, and the plug-in 
content is shown.

3 - A script later comes along and dynamically changes the object's type 
attribute to application/some-unsupported-type

Should the browser dynamically and immediately switch from the plug-in to the 
fallback image?
If not, what should it do?
And is this specified anywhere?

Thanks,
~Brady



Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Boris Zbarsky

On 12/5/11 12:42 PM, Leif Halvard Silli wrote:

Last I checked, some of those locales defaulted to UTF-8. (And HTML5
defines it the same.) So how is that possible?


Because authors authoring pages that users of those locales tend to use 
use UTF-8 more than anything else?



Don't users of those locales travel as much as you do?


People on average travel less than David does, yes.  In all locales.

But that's not the point.  I think you completely misunderstood his 
comments about travel and locales.  Keep reading.



What kind of trouble are you actually describing here? You are
describing a problem with using UTF-8 for *your locale*.


No.  He's describing a problem using UTF-8 to view pages that are not 
written in English.


Now what language are the non-English pages you look at written in? 
Well, it depends.  In western Europe they tend to be in languages that 
can be encoded in ISO-8859-1, so authors sometimes use that encoding 
(without even realizing it).  If you set your browser to default to 
UTF-8, those pages will be broken.


In Japan, a number of pages are authored in Shift_JIS.  Those will 
similarly be broken in a browser defaulting to UTF-8.



What is your locale?


Why does it matter?  David's default locale is almost certainly en-US, 
which defaults to ISO-8859-1 (or whatever Windows-??? encoding that 
actually means on the web) in his browser.  But again, he's changed the 
default encoding from the locale default, so the locale is irrelevant.



(Quite often it sounds as
if some see Latin-1 - or Windows-1251 as we now should say - as a
'super default' rather than a locale default. If that is the case, that
it is a super default, then we should also spec it like that! Until
further, I'll treat Latin-1 as it is specced: As a default for certain
locales.)


That's exactly what it is.


Since it is a locale problem, we need to understand which locale you
have - and/or which locale you - and other debaters - think they have.


Again, doesn't matter if you change your settings from the default.


However, you also say that your problem is not so much related to pages
written for *your* locale as it is related for pages written for users
of *other* locales. So how many times per year do Dutch, Spanish or
Norwegian  - and other non-English pages - are creating troubles for
you, as a English locale user? I am making an assumption: Almost never.
You don't read those languages, do you?


Did you miss the travel part?  Want to look up web pages for museums, 
airports, etc in a non-English speaking country?  There's a good chance 
they're not in English!



This is also an expectation thing: If you visit a Russian page in a
legacy Cyrillic encoding, and gets mojibake because your browser
defaults to Latin-1, then what does it matter to you whether your
browser defaults to Latin-1 or UTF-8? Answer: Nothing.


Yes.  So?


I think we should 'attack' the dominating locale first: The English
locale, in its different incarnations (Australian, American, UK). Thus,
we should turn things on the head: English users should start to expect
UTF-8 to be used. Because, as English users, you are more used to
'mojibake' than the rest of us are: Whenever you see it, you 'know'
that it is because it is a foreign language you are reading.


Modulo smart quotes (and recently unicode ellipsis characters).  These 
are actually pretty common in English text on the web nowadays, and have 
a tendency to be in ISO-8859-1.



Or, please, explain to us when and where it
is important that English language users living in their own, native
lands so to speak, need that their browser default to Latin-1 so that
they can correctly read English language pages?


See above.


See? We would have a plan. Or what do you think?


Try it in your browser.  When I set UTF-8 as my default, there were 
broke quotation marks all over the web for me.  And I'm talking pages in 
English.


-Boris


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
Boris Zbarsky Mon Dec 5 13:49:45 PST 2011:
 On 12/5/11 12:42 PM, Leif Halvard Silli wrote:
 Last I checked, some of those locales defaulted to UTF-8. (And HTML5
 defines it the same.) So how is that possible?
 
 Because authors authoring pages that users of those locales  
 tend to use use UTF-8 more than anything else?

It is more likely that there is another reason, IMHO: They may have 
tried it, and found that it worked OK. But they of course have the same 
need for reading non-English museum and railway pages as Mozilla 
employees.

 Don't users of those locales travel as much as you do?

 I think you completely misunderstood his 
 comments about travel and locales.  Keep reading.

I'm pretty sure I haven't misunderstood very much.

 What kind of trouble are you actually describing here? You are
 describing a problem with using UTF-8 for *your locale*.
 
 No.  He's describing a problem using UTF-8 to view pages that are not 
 written in English.

And why is that a problem in those cases when it is a problem? Do he 
read those languages, anyway? Don't we expect some problems when we 
thread out of our borders?
 
 Now what language are the non-English pages you look at written in? 
 Well, it depends.  In western Europe they tend to be in languages that 
 can be encoded in ISO-8859-1, so authors sometimes use that encoding 
 (without even realizing it).  If you set your browser to default to 
 UTF-8, those pages will be broken.
 
 In Japan, a number of pages are authored in Shift_JIS.  Those will 
 similarly be broken in a browser defaulting to UTF-8.

The solution I proposed was that English locale browsers should default 
to UTF-8. Of course, to such users, then when in Japan, they could 
get problems - on some Japanese pages, which is a small nuisance, 
especially if they read Japansese.

 What is your locale?
 
 Why does it matter?  David's default locale is almost certainly en-US, 
 which defaults to ISO-8859-1 (or whatever Windows-??? encoding that 
 actually means on the web) in his browser.  But again, he's changed the 
 default encoding from the locale default, so the locale is irrelevant.

The locale is meant to predominantly be used within a physical locale. 
If he is at another physical locale or a virtually other locale, he 
should not be expecting that it works out of the box unless a common 
encoding is used. Even today, if he visits Japan, he has to either 
change his browser settings *or* to rely on the pages declaring their 
encodings. So nothing would change, for him, when visiting Japan — with 
his browser or with his computer.

Yes, there would be a change, w.r.t. Enlgish quotation marks (see 
below) and w.r.tg. visiting Western European languages pages: For those 
a number of pages which doesn't fail with Win-1252 as the default, 
would start to fail. But relatively speaking, it is less important that 
non-English pages fail for the English locale.

 (Quite often it sounds as
 if some see Latin-1 - or Windows-1251 as we now should say - as a
 'super default' rather than a locale default. If that is the case, that
 it is a super default, then we should also spec it like that! Until
 further, I'll treat Latin-1 as it is specced: As a default for certain
 locales.)
 
 That's exactly what it is.

A default for certain locales? Right.

 Since it is a locale problem, we need to understand which locale you
 have - and/or which locale you - and other debaters - think they have.
 
 Again, doesn't matter if you change your settings from the default.

I don't think I have misunderstood anything.
 
 However, you also say that your problem is not so much related to pages
 written for *your* locale as it is related for pages written for users
 of *other* locales. So how many times per year do Dutch, Spanish or
 Norwegian  - and other non-English pages - are creating troubles for
 you, as a English locale user? I am making an assumption: Almost never.
 You don't read those languages, do you?
 
 Did you miss the travel part?  Want to look up web pages for museums, 
 airports, etc in a non-English speaking country?  There's a good chance 
 they're not in English!

There is a very good chance, also, that only very few of the Web pages 
for such professional institutions would fail to declare their encoding.

 This is also an expectation thing: If you visit a Russian page in a
 legacy Cyrillic encoding, and gets mojibake because your browser
 defaults to Latin-1, then what does it matter to you whether your
 browser defaults to Latin-1 or UTF-8? Answer: Nothing.
 
 Yes.  So?

So we can look away from Greek, Cyrillic, Japanese, Chinese etc etc in 
this debate. The eventually only benefit for English locale user of 
keeping WIN-1252 as the default, is that they can have a tiny number of 
fewer problems when visiting Western-European language web pages with 
their computer. (Yes, fI saw that you mention smart quotes etc below - 
so there is that reason too.) 

 I think we should 'attack' the dominating 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Kornel Lesiński

On Fri, 02 Dec 2011 15:50:31 -, Henri Sivonen hsivo...@iki.fi wrote:


That compatibility mode already exists: It's the default mode--just
like the quirks mode is the default for pages that don't have a
doctype. You opt out of the quirks mode by saying !DOCTYPE html. You
opt out of the encoding compatibility mode by saying meta
charset=utf-8.


Could !DOCTYPE html be an opt-in to default UTF-8 encoding?

It would be nice to minimize number of declarations a page needs to  
include.


--
regards, Kornel Lesiński


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Darin Adler
On Dec 5, 2011, at 4:10 PM, Kornel Lesiński wrote:

 Could !DOCTYPE html be an opt-in to default UTF-8 encoding?
 
 It would be nice to minimize number of declarations a page needs to include.

I like that idea. Maybe it’s not too late.

-- Darin

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Boris Zbarsky

On 12/5/11 6:14 PM, Leif Halvard Silli wrote:

It is more likely that there is another reason, IMHO: They may have
tried it, and found that it worked OK


Where by it you mean open a text editor, type some text, and save. 
So they get whatever encoding their OS and editor defaults to.


And yes, then they find that it works ok, so they don't worry about 
encodings.



No.  He's describing a problem using UTF-8 to view pages that are not
written in English.


And why is that a problem in those cases when it is a problem?


Because the characters are wrong?


Do he read those languages, anyway?


Do you read English?  Seriously, what are you asking there, exactly?

(For the record, reading a particular page in a language is a much 
simpler task than reading the language; I can't read German, but I can 
certainly read a German subway map.)



The solution I proposed was that English locale browsers should default
to UTF-8.


I know the solution you proposed.  That solution tries to avoid the 
issues David was describing by only breaking things for people in 
English browser locales, I understand that.



Why does it matter?  David's default locale is almost certainly en-US,
which defaults to ISO-8859-1 (or whatever Windows-??? encoding that
actually means on the web) in his browser.  But again, he's changed the
default encoding from the locale default, so the locale is irrelevant.


The locale is meant to predominantly be used within a physical locale.


Yes, so?


If he is at another physical locale or a virtually other locale, he
should not be expecting that it works out of the box unless a common
encoding is used.


He was responding to a suggestion that the default encoding be changed 
to UTF-8 for all locales.  Are you _really_ sure you understood the 
point of his mail?



Even today, if he visits Japan, he has to either
change his browser settings *or* to rely on the pages declaring their
encodings. So nothing would change, for him, when visiting Japan — with
his browser or with his computer.


He wasn't saying it's a problem for him per se.  He's a somewhat 
sophisticated browser user who knows how to change the encoding for a 
particular page.


What he was saying is that there are lots of pages out there that aren't 
encoded in UTF-8 and rely on locale fallbacks to particular encodings, 
and that he's run into them a bunch while traveling in particular, so 
they were not pages in English.  So far, you and he seem to agree.



Yes, there would be a change, w.r.t. Enlgish quotation marks (see
below) and w.r.tg. visiting Western European languages pages: For those
a number of pages which doesn't fail with Win-1252 as the default,
would start to fail. But relatively speaking, it is less important that
non-English pages fail for the English locale.


No one is worried about that, particularly.


There is a very good chance, also, that only very few of the Web pages
for such professional institutions would fail to declare their encoding.


You'd be surprised.


Modulo smart quotes (and recently unicode ellipsis characters).  These
are actually pretty common in English text on the web nowadays, and have
a tendency to be in ISO-8859-1.


If we change the default, they will start to tend to be in UTF-8.


Not unless we change the authoring tools.  Half the time these things 
are just directly exported from a word processor.



OK: Quotation marks. However, in 'old web pages', then you also find
much more use of HTML entities (such asldquo;) than you find today.
We should take advantage of that, no?


I have no idea what you're trying to say,


When you mention quotation marks, then you mention a real locale
related issue. And may be the Euro sign too?


Not an issue for me personally, but it could be for some, yes.


Nevertheless, the problem is smallest for languages that primarily limit their 
alphabet to those
letter that are present in the American Standard Code for Information
Interchange format.


Sure.  It may still be too big.


It would be logical, thus, to start the switch to
UTF-8 for those locales


If we start at all.


Perhaps we need to have a project to measure these problems, instead of
all these anecdotes?


Sure.  More data is always better than ancedotes.

-Boris


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
On 12/5/11 6:14 PM, Leif Halvard Silli wrote:
 It is more likely that there is another reason, IMHO: They may have
 tried it, and found that it worked OK
 
 Where by it you mean open a text editor, type some text, and save. 
 So they get whatever encoding their OS and editor defaults to.

If that is all they tested, then I'd said they did not test enough.

 And yes, then they find that it works ok, so they don't worry about 
 encodings.

Ditto.
 
 No.  He's describing a problem using UTF-8 to view pages that are not
 written in English.

 And why is that a problem in those cases when it is a problem?
 
 Because the characters are wrong?

But the characters will be wrong many more times than exactly those 
times when he tries to read a  Web page with a Western European 
languages that is not declared as WIN-1252. Does English locale uses 
have particular expectations with regard to exactly those Web pages? 
What about Polish Web pages etc? English locale users is a very 
multiethnic lot.

 Do he read those languages, anyway?
 
 Do you read English?  Seriously, what are you asking there, exactly?

Because if it is an issue, then it is an about expectations for exactly 
those pages. (Plus the quote problem, of course.)

 (For the record, reading a particular page in a language is a much 
 simpler task than reading the language; I can't read German, but I can 
 certainly read a German subway map.)

Or Polish subway map - which doesn't default to said encoding.

 The solution I proposed was that English locale browsers should default
 to UTF-8.
 
 I know the solution you proposed.  That solution tries to avoid the 
 issues David was describing by only breaking things for people in 
 English browser locales, I understand that.

That characterization is only true with regard to the quote problem. 
That German pages breaks would not be any more important than the 
fact that Polish pages would. For that matter: It happens that UTF-8 
pages breaks as well.

I only suggest it as a first step, so to speak. Or rather - since some 
locales apparently already default to UTF-9 - as a next step. 
Thereafter, more locales would be expected to follow suit - as the 
development of each locale permits.

 Why does it matter?  David's default locale is almost certainly en-US,
 which defaults to ISO-8859-1 (or whatever Windows-??? encoding that
 actually means on the web) in his browser.  But again, he's changed the
 default encoding from the locale default, so the locale is irrelevant.

 The locale is meant to predominantly be used within a physical locale.
 
 Yes, so?

So then we have a set of expectations for the language of that locale. 
If we look at how the locale settings handles other languages, then we 
are outside the issue that the locale specific encodings are supposed 
to handle.

 If he is at another physical locale or a virtually other locale, he
 should not be expecting that it works out of the box unless a common
 encoding is used.
 
 He was responding to a suggestion that the default encoding be changed 
 to UTF-8 for all locales.  Are you _really_ sure you understood the 
 point of his mail?

I said I agreed with him that Faruk's solution was not good. However, I 
would not be against treating DOCTYPE html as a 'default to UTF-8' 
declaration, as suggested by some - if it were possible to agree about 
that. Then we could keep things as they are, except for the HTML5 
DOCTYPE. I guess the HTML5 doctype would become 'the default before the 
default': If everything else fails, then UTF-8 if the DOCTYPE is 
!DOCTYPE html, or else, the locale default.

It sounded like Darin Adler thinks it possible. How about you?
 
 Even today, if he visits Japan, he has to either
 change his browser settings *or* to rely on the pages declaring their
 encodings. So nothing would change, for him, when visiting Japan — with
 his browser or with his computer.
 
 He wasn't saying it's a problem for him per se.  He's a somewhat 
 sophisticated browser user who knows how to change the encoding for a 
 particular page.

If we are talking about English locale user visiting Japan, then I 
doubt a change in the default encoding would matter - Win-1252 as 
default would anyway be wrong.

 What he was saying is that there are lots of pages out there that aren't 
 encoded in UTF-8 and rely on locale fallbacks to particular encodings, 
 and that he's run into them a bunch while traveling in particular, so 
 they were not pages in English.  So far, you and he seem to agree.

So far we agree, yes.
 
 Yes, there would be a change, w.r.t. Enlgish quotation marks (see
 below) and w.r.tg. visiting Western European languages pages: For those
 a number of pages which doesn't fail with Win-1252 as the default,
 would start to fail. But relatively speaking, it is less important that
 non-English pages fail for the English locale.
 
 No one is worried about that, particularly.

You spoke about visiting German pages above - sounded like you worried, 
but 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Boris Zbarsky

On 12/5/11 9:55 PM, Leif Halvard Silli wrote:

If that is all they tested, then I'd said they did not test enough.


That's normal for the web.


(For the record, reading a particular page in a language is a much
simpler task than reading the language; I can't read German, but I can
certainly read a German subway map.)


Or Polish subway map - which doesn't default to said encoding.


Indeed.  I don't think anyone thinks the existing situation is all fine 
or anything.



I said I agreed with him that Faruk's solution was not good. However, I
would not be against treatingDOCTYPE html  as a 'default to UTF-8'
declaration


This might work, if there hasn't been too much cargo-culting yet.  Data 
urgently needed!



Not unless we change the authoring tools.  Half the time these things
are just directly exported from a word processor.


Please educate me. I'm perhaps 'handicapped' in that regard: I haven't
used MS Word on a regular basis since MS Word 5.1 for Mac. Also, if
export means copy and paste


It can mean that, or save as HTML followed by copy and paste.


then on the Mac, everything gets
converted via the clipboard


On Mac, the default OS encoding is UTF-8 last I checked.  That's 
decidedly not the case on Windows.



OK: Quotation marks. However, in 'old web pages', then you also find
much more use of HTML entities (such as“) than you find today.
We should take advantage of that, no?


I have no idea what you're trying to say,


Sorry. What I meant was that character entities are encoding
independent.


Yes.


And that lots of people - and authoring tools - have
inserted non-ASCII letters and characters as character entities,


Sure.  And lots have inserted them directly.


At any rate: A page which uses
character entities for non-ascii would render the same regardless of
encoding, hence a switch to UTF-8 would not matter for those.


Sure.  We're not worried about such pages here.

-Boris



Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
Boris Zbarsky Mon Dec 5 19:18:10 PST 2011:
 On 12/5/11 9:55 PM, Leif Halvard Silli wrote:

 I said I agreed with him that Faruk's solution was not good. However, I
 would not be against treating DOCTYPE html as a 'default to UTF-8'
 declaration
 
 This might work, if there hasn't been too much cargo-culting yet.  Data 
 urgently needed!

Yeah, it would be a pity if it had already become an widespread 
cargo-cult to - all at once - use HTML5 doctype without using UTF-8 
*and* without using some encoding declaration *and* thus effectively 
relying on the default locale encoding ... Who does have a data corpus? 
Henri, as Validator.nu developer?

This change would involve adding one more step in the HTML5 parser's 
encoding sniffing algorithm. [1] The question then is when, upon seeing 
the HTML5 doctype, the default to UTF-8 ought to happen, in order to be 
useful. It seems it would have to happen after the processing of the 
explicit meta data (Step 1 to 5) but before the last 3 steps - step 6, 
7 and 8:

Step 6: 'if the user agent has information on the likely encoding'
Step 7: UA 'may attempt to autodetect the character encoding'
Step 8: 'implementation-defined or user-specified default'

The role of the HTML5 DOCTYPE, encoding wise, would then be to ensure 
that step 6 to 8 does not happen. 

[1] http://dev.w3.org/html5/spec/parsing#encoding-sniffing-algorithm
-- 
Leif H Silli


Re: [whatwg] object, type, and fallback

2011-12-05 Thread Simon Pieters

On Mon, 05 Dec 2011 22:19:33 +0100, Brady Eidson beid...@apple.com wrote:


I can't find a definitive answer for the following scenario:

1 - A page has a plug-in with fallback specified as follows:

object type=application/x-shockwave-flash
param name=movie value=Example.swf/
img src=Fallback.png
/object

2 - The page is loaded, the browser instantiates the plug-in, and the  
plug-in content is shown.


3 - A script later comes along and dynamically changes the object's  
type attribute to application/some-unsupported-type


Should the browser dynamically and immediately switch from the plug-in  
to the fallback image?

If not, what should it do?
And is this specified anywhere?

Thanks,
~Brady



... when neither its classid attribute nor its data attribute are  
present, whenever its type attribute is set, changed, or removed: the user  
agent must queue a task to run the following steps to (re)determine what  
the object element represents. The task source for this task is the DOM  
manipulation task source.


http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#the-object-element

The algorithm then determines in step 5 that there's no suitable plugin,  
and falls back.


--
Simon Pieters
Opera Software


[whatwg] CSP sandbox directive integration with HTML

2011-12-05 Thread Adam Barth
I wrote some somewhat goofy text in the CSP spec trying to integrate
the sandbox directive with HTML's iframe sandbox machinery.  Hixie and
I chatted in #whatwg about how best to do the integration.  I think
Hixie is going to refactor the machinery in the spec to be a bit more
generic and to call out to the CSP spec to get the sandbox flags from
the CSP policy.  There are more details in the IRC log below.

Thanks,
Adam


[06:43am] abarth: Hixie: do you have a moment to tell me how nutty
this text about sandbox flags is?
http://dvcs.w3.org/hg/content-security-policy/raw-file/tip/csp-specification.dev.html#sandbox
[06:43am] abarth: When enforcing the sandbox directive, the user
agent must set the sandbox flags for the protected document as if the
document where contained in a nested browsing context within a
document with sandbox flags given by the the directive-value.
[06:45am] Hixie: hrm
[06:45am] abarth: i don't think its quite right
[06:45am] abarth: i couldn't find a good hook in HTML for this
[06:45am] Hixie: what you probably want to do is set some hook that i
can then do the right magic with
[06:46am] Hixie: rather than try to poke the html spec flags
[06:46am] abarth: ok
[06:46am] Hixie: because the flags you have to set are pretty complex and subtle
[06:46am] Hixie: and involve the navigation algorithm, etc
[06:46am] abarth: how about the CSP sandbox flags as a property of a Document
[06:46am] abarth: which will be a string like you'd get in the iframe attribute?
[06:46am] abarth: so HTML handles the parsing
[06:46am] Hixie: has to be on a browsing context, not a document
[06:46am] Hixie: doesn't make sense to sandbox a document
[06:46am] abarth: why not?
[06:47am] abarth: sorry, let me ask a different question
[06:47am] abarth: is a browsing context preserved across navigations?
[06:47am] Hixie: yes
[06:48am] Hixie: but the flags can change during the lifetime of the
browsing context
[06:48am] abarth: ah
[06:48am] abarth: ok
[06:48am] Hixie: what matters to all teh security stuff is the state
when the browsing context was last navigated
[06:49am] Hixie: e.g. if... its browsing context had its sandboxed
forms browsing context flag set when the Document was created ...
[06:49am] abarth: i see
[06:49am] Margle joined the chat room.
[06:49am] Hixie: but the net result is that you have to set the flags
before the document is created
[06:49am] abarth: do we have the response headers when the document is created?
[06:49am] Hixie: er, before the Document is created
[06:49am] Hixie: sure
[06:49am] Hixie: assuming it came over HTTP
[06:50am] abarth: ok, so when the document is created, HTML needs to
ask about the CSP policy for the document
[06:50am] abarth: or for the response
[06:50am] Hixie: we get the headers by navigate step 19 or so (type
sniffing step), we create the document as a side-effect of step 20
(the switch statement that relies on the sniffed type)
[06:51am] abarth: Upon receiving an HTTP response containing ...
[06:51am] abarth: that's when the CSP policy starts getting enforced
[06:51am] abarth: Upon receiving an HTTP response containing at least
one Content-Security-Policy header field, the user agent must enforce
the combination of all the policies contained in these header fields.
[06:52am] Hixie: so... what happens if the page navigates itself to a
page without the CSP?
[06:52am] Hixie: or does a history.back() to a accomplice page that
isn't sandboxed?
[06:52am] abarth: that's fine
[06:53am] abarth: consider the unique-origin sandbox bits
[06:53am] abarth: or the disable-script
[06:53am] Hixie: k
[06:53am] abarth: those make sense on a per-document basisi
[06:53am] Hixie: so when do we reset the flags?
[06:53am] abarth: each navigation
[06:54am] abarth: what actually happens in the implementation is that
we copy the sandbox flags from the Frame to the Document when the
document is created
[06:54am] abarth: because we're supposed to freeze the sandbox flags
[06:54am] abarth: we enquire about the CSP policy at that time
[06:54am] abarth: that happens each time a new document is loaded into a Frame
[06:54am] Hixie: hmm... the document is created before the session
history change happens
[06:55am] Hixie: so we'd have to reset the flags before the old
document is removed...
[06:55am] Hixie: might make sense to just set the flags temporarily
while the document is being created or something
[06:55am] Hixie: how is this supposed to interact with the sandbox
attribute? union?
[06:55am] abarth: can we not just set them on the document when we
copy the state to the document?
[06:56am] abarth: Hixie: its the same combination operator that
happens when you have nested iframes
[06:56am] abarth: that each contribute a sandbox attribute
[06:57am] Hixie: hmmm
[06:57am] Hixie: so the way it works for nested iframes is that
setting the flag on an iframe just forces it on for all descendants
iframes
[06:58am] abarth: yeah, so the union
[06:58am] abarth: (assuming the items are things like sandboxed

Re: [whatwg] Fixing undo on the Web - UndoManager and Transaction

2011-12-05 Thread Ryosuke Niwa
Hi all,

I've added more examples to the document:
http://rniwa.com/editing/undomanager.html and also requested feedback on
public-webapps. As of this revision, I consider the specification is ready
for implementation feedback. I will start prototyping it for WebKit and
start writing tests.

I also welcome your test cases if you have any (do I need to setup a repo
for this?).

Best regards,
Ryosuke Niwa
Software Engineer
Google Inc.