Re: [whatwg] Default encoding to UTF-8?

2012-04-04 Thread Henri Sivonen
On Tue, Apr 3, 2012 at 10:08 PM, Anne van Kesteren ann...@opera.com wrote:
 I didn't mean a prescan.  I meant proceeding with the real parse and
 switching decoders in midstream. This would have the complication of
 also having to change the encoding the document object reports to
 JavaScript in some cases.

 On IRC (#whatwg) zcorpan pointed out this would break URLs where entities
 are used to encode non-ASCII code points in the query component.

Good point.  So it's not worthwhile to add magic here.  It's better
that authors declare that they are using UTF-8.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2012-04-03 Thread Henri Sivonen
On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:
 I mean the performance impact of reloading the page or,
 alternatively, the loss of incremental rendering.)

 A solution that would border on reasonable would be decoding as
 US-ASCII up to the first non-ASCII byte

 Thus possibly prescan of more than 1024 bytes?

I didn't mean a prescan.  I meant proceeding with the real parse and
switching decoders in midstream. This would have the complication of
also having to change the encoding the document object reports to
JavaScript in some cases.

 and then deciding between
 UTF-8 and the locale-specific legacy encoding by examining the first
 non-ASCII byte and up to 3 bytes after it to see if they form a valid
 UTF-8 byte sequence.

 Except for the specifics, that sounds like more or less the idea I
 tried to state. May be it could be made into a bug in Mozilla?

It's not clear that this is actually worth implementing or spending
time on its this stage.

 However, there is one thing that should be added: The parser should
 default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII.

That would break form submissions.

 But trying to gain more statistical confidence
 about UTF-8ness than that would be bad for performance (either due to
 stalling stream processing or due to reloading).

 So here you say tthat it is better to start to present early, and
 eventually reload [I think] if during the presentation the encoding
 choice shows itself to be wrong, than it would be to investigate too
 much and be absolutely certain before starting to present the page.

I didn't intend to suggest reloading.

 Adding autodetection wouldn't actually force authors to use UTF-8, so
 the problem Faruk stated at the start of the thread (authors not using
 UTF-8 throughout systems that process user input) wouldn't be solved.

 If we take that logic to its end, then it would not make sense for the
 validator to display an error when a page contains a form without being
 UTF-8 encoded, either. Because, after all, the backend/whatever could
 be non-UTF-8 based. The only way to solve that problem on those
 systems, would be to send form content as character entities. (However,
 then too the form based page should still be UTF-8 in the first place,
 in order to be able to take any content.)

Presumably, when an author reacts to an error message, (s)he not only
fixes the page but also the back end.  When a browser makes encoding
guesses, it obviously cannot fix the back end.

 [ Original letter continued: ]
 Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
 detection. So it might still be an competitive advantage.

 It would be interesting to know what exactly Chrome does. Maybe
 someone who knows the code could enlighten us?

 +1 (But their approach looks similar to the 'border on sane' approach
 you presented. Except that they seek to detect also non-UTF-8.)

I'm slightly disappointed but not surprised that this thread hasn't
gained a message explaining what Chrome does exactly.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2012-04-03 Thread Anne van Kesteren

On Tue, 03 Apr 2012 13:59:25 +0200, Henri Sivonen hsivo...@iki.fi wrote:

On Wed, Jan 4, 2012 at 12:34 AM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:

A solution that would border on reasonable would be decoding as
US-ASCII up to the first non-ASCII byte


Thus possibly prescan of more than 1024 bytes?


I didn't mean a prescan.  I meant proceeding with the real parse and
switching decoders in midstream. This would have the complication of
also having to change the encoding the document object reports to
JavaScript in some cases.


On IRC (#whatwg) zcorpan pointed out this would break URLs where entities  
are used to encode non-ASCII code points in the query component.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Default encoding to UTF-8?

2012-01-03 Thread Henri Sivonen
On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli l...@russisk.no wrote:
 It's unclear to me if you are talking about HTTP-level charset=UNICODE
 or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
 BOMless?

 Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.)
 seems to usually be BOM-full. But there are still enough occurrences
 of pages without BOM. I have found UTF-8 pages with the charset=unicode
 label in meta. But the few page I found contained either BOM or
 HTTP-level charset=utf-8. I have to little research material when it
 comes to UTF-8 pages with charset=unicode inside.

Making 'unicode' an alias of UTF-16 or UTF-16LE would be useless for
pages that have a BOM, because the BOM is already inspected before
meta and if HTTP-level charset is unrecognized, the BOM wins.

Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for
UTF-8-encoded pages that say charset=unicode in meta if alias
resolution happens before UTF-16 labels are mapped to UTF-8.

Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for
pages that are (BOMless) UTF-16LE and that have charset=unicode in
meta, because the meta prescan doesn't see UTF-16-encoded metas.
Furthermore, it doesn't make sense to make the meta prescan look for
UTF-16-encoded metas, because it would make sense to honor the value
only if it matched a flavor of UTF-16 appropriate for the pattern of
zero bytes in the file, so it would be more reliable and straight
forward to just analyze the pattern of zero bytes without bothering to
look for UTF-16-encoded metas.

 When the detector says UTF-8 - that is step 7 of the sniffing algorith,
 no?
 http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

Yes.

  2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
 detector says non-UTF-8.
...
 I think you are mistaken there: If parsers perform UTF-8 detection,
 then unlabelled pages will be detected, and no reparsing will happen.
 Not even increase. You at least need to explain this negative spiral
 theory better before I buy it ... Step 7 will *not* lead to reparsing
 unless the default encoding is WINDOWS-1252. If the default encoding is
 UTF-8, then step 7, when it detects UTF-8, then it means that parsing
 can continue uninterrupted.

That would be what I labeled as option #2 above.

 What we will instead see is that those using legacy encodings must be
 more clever in labelling their pages, or else they won't be detected.

Many pages that use legacy encodings are legacy pages that aren't
actively maintained. Unmaintained pages aren't going to become more
clever about labeling.

 I am a bitt baffled here: It sounds like you say that there will be bad
 consequences if browsers becomes more reliable ...

Becoming more reliable can be bad if the reliability comes at the cost
of performance, which would be the case if the kind of heuristic
detector that e.g. Firefox has was turned on for all locales. (I don't
mean the performance impact of running a detector state machine. I
mean the performance impact of reloading the page or, alternatively,
the loss of incremental rendering.)

A solution that would border on reasonable would be decoding as
US-ASCII up to the first non-ASCII byte and then deciding between
UTF-8 and the locale-specific legacy encoding by examining the first
non-ASCII byte and up to 3 bytes after it to see if they form a valid
UTF-8 byte sequence. But trying to gain more statistical confidence
about UTF-8ness than that would be bad for performance (either due to
stalling stream processing or due to reloading).

 Apart from UTF-16, Chrome seems quite aggressive w.r.t. encoding
 detection. So it might still be an competitive advantage.

It would be interesting to know what exactly Chrome does. Maybe
someone who knows the code could enlighten us?

 * Let's say that I *kept* ISO-8859-1 as default encoding, but instead
 enabled the Universal detector. The frame then works.
 * But if I make the frame page very short, 10 * the letter ø as
 content, then the Universal detector fails - on a test on my own
 computer, it guess the page to be Cyrillic rather than Norwegian.
 * What's the problem? The Universal detector is too greedy - it tries
 to fix more problems than I have. I only want it to guess on UTF-8.
 And if it doesn't detect UTF-8, then it should fall back to the locale
 default (including fall back to the encoding of the parent frame).

 Wouldn't that be an idea?

 No. The current configuration works for Norwegian users already. For
 users from different silos, the ad might break, but ad breakage is
 less bad than spreading heuristic detection to more locales.

 Here I must disagree: Less bad for whom?

For users performance-wise.

--
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2012-01-03 Thread Henri Sivonen
On Tue, Jan 3, 2012 at 10:33 AM, Henri Sivonen hsivo...@iki.fi wrote:
 A solution that would border on reasonable would be decoding as
 US-ASCII up to the first non-ASCII byte and then deciding between
 UTF-8 and the locale-specific legacy encoding by examining the first
 non-ASCII byte and up to 3 bytes after it to see if they form a valid
 UTF-8 byte sequence. But trying to gain more statistical confidence
 about UTF-8ness than that would be bad for performance (either due to
 stalling stream processing or due to reloading).

And it's worth noting that the above paragraph states a solution to
the problem that is: How to make it possible to use UTF-8 without
declaring it?

Adding autodetection wouldn't actually force authors to use UTF-8, so
the problem Faruk stated at the start of the thread (authors not using
UTF-8 throughout systems that process user input) wouldn't be solved.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2012-01-03 Thread Leif Halvard Silli
Henri Sivonen, Tue Jan 3 00:33:02 PST 2012:
 On Thu, Dec 22, 2011 at 12:36 PM, Leif Halvard Silli wrote:

 Making 'unicode' an alias of UTF-16 or UTF-16LE would be useful for
 UTF-8-encoded pages that say charset=unicode in meta if alias
 resolution happens before UTF-16 labels are mapped to UTF-8.

Yup.
 
 Making 'unicode' an alias for UTF-16 or UTF-16LE would be useless for
 pages that are (BOMless) UTF-16LE and that have charset=unicode in
 meta, because the meta prescan doesn't see UTF-16-encoded metas.

Hm. Yes. I see that I misread something, and ended up believing that 
the meta would *still* be used if the mapping from 'UTF-16' to 
'UTF-8' turned out to be incorrect. I guess I had not understood, well 
enough, that the meta prescan *really* doesn't see UTF-16-encoded 
metas. Also contributing was the fact that I did nto realize that IE 
doesn't actually read the page as UTF-16 but as Windows-1252: 
http://www.hughesrenier.be/actualites.html. (Actually, browsers does 
see the UTF-16 meta, but only if the default encoding is set to be 
UTF-16 - see step 1 of '8.2.2.4 Changing the encoding while parsing' 
http://dev.w3.org/html5/spec/parsing.html#change-the-encoding.)

 Furthermore, it doesn't make sense to make the meta prescan look for
 UTF-16-encoded metas, because it would make sense to honor the value
 only if it matched a flavor of UTF-16 appropriate for the pattern of
 zero bytes in the file, so it would be more reliable and straight
 forward to just analyze the pattern of zero bytes without bothering to
 look for UTF-16-encoded metas.

Makes sense.

   [ snip ]
 What we will instead see is that those using legacy encodings must be
 more clever in labelling their pages, or else they won't be detected.
 
 Many pages that use legacy encodings are legacy pages that aren't
 actively maintained. Unmaintained pages aren't going to become more
 clever about labeling.

But their Non-UTF-8-ness should be picked up in the first 1024 bytes?

  [... sniff - sorry, meant snip ;-) ...]

 I mean the performance impact of reloading the page or, 
 alternatively, the loss of incremental rendering.)

 A solution that would border on reasonable would be decoding as
 US-ASCII up to the first non-ASCII byte

Thus possibly prescan of more than 1024 bytes? Is it faster to scan 
ASCII? (In Chrome, there does not seem to be an end to the prescan, as 
long as the text source code is ASCII only.)

 and then deciding between
 UTF-8 and the locale-specific legacy encoding by examining the first
 non-ASCII byte and up to 3 bytes after it to see if they form a valid
 UTF-8 byte sequence.

Except for the specifics, that sounds like more or less the idea I 
tried to state. May be it could be made into a bug in Mozilla? (I could 
do it, but ...)

However, there is one thing that should be added: The parser should 
default to UTF-8 even if it does not detect any UTF-8-ish non-ASCII. Is 
that part of your idea? Because, if it does not behave like that, then 
it would work as Google Chrome now does work. Which for the following, 
UTF-8 encoded (but charset-un-labelled) page means, that it default to 
UTF-8:

!DOCTYPE htmltitleæøå/title/html

While it for this - identical - page, would default to the locale 
encoding, due to the use of ASCII based character entities, which 
causes that it does not detect any UTF-8-ish characters:

!DOCTYPE htmltitle#xe6;#xf8;#xe5;/title/html

As weird variant of the latter example is UTF-8 based data URIs, where 
all browsers (that I could test - IE only supports data URIs in the 
@src attribute, including script@src) default to the locale encoding 
(apart for Mozilla Camino - which has character detection enabled by 
default):

data:text/html,!DOCTYPE htmltitle%C3%A6%C3%B8%C3%A5/title/html

All the 3 examples above should default to UTF-8, if the border on 
sane approach was applied.

 But trying to gain more statistical confidence
 about UTF-8ness than that would be bad for performance (either due to
 stalling stream processing or due to reloading).

So here you say tthat it is better to start to present early, and 
eventually reload [I think] if during the presentation the encoding 
choice shows itself to be wrong, than it would be to investigate too 
much and be absolutely certain before starting to present the page.

Later, at Jan 3 00:50:26 PST 2012, you added:
 And it's worth noting that the above paragraph states a solution to
 the problem that is: How to make it possible to use UTF-8 without
 declaring it?

Indeed.

 Adding autodetection wouldn't actually force authors to use UTF-8, so
 the problem Faruk stated at the start of the thread (authors not using
 UTF-8 throughout systems that process user input) wouldn't be solved.

If we take that logic to its end, then it would not make sense for the 
validator to display an error when a page contains a form without being 
UTF-8 encoded, either. Because, after all, the backend/whatever could 
be non-UTF-8 based. The only way to solve that 

Re: [whatwg] Default encoding to UTF-8?

2011-12-22 Thread Leif Halvard Silli
Henri Sivonen hsivonen Mon Dec 19 07:17:43 PST 2011
 On Sun, Dec 11, 2011 at 1:21 PM, Leif Halvard Silli wrote:

Sorry for my slow reply.

 It surprises me greatly that Gecko doesn't treat unicode as an alias
 for utf-16.
 
 Which must
 EITHER mean that many of these pages *are* UTF-16 encoded OR that their
 content is predominantly  US-ASCII and thus the artefacts of parsing
 UTF-8 pages (UTF-16 should be treated as UTF-8 when it isn't
 UTF-16) as WINDOWS-1252, do not affect users too much.
 
 It's unclear to me if you are talking about HTTP-level charset=UNICODE
 or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
 BOMless?

Charset=UNICODE in meta, as generated by MS tools (Office or IE, eg.) 
seems to usually be BOM-full. But there are still enough occurrences 
of pages without BOM. I have found UTF-8 pages with the charset=unicode 
label in meta. But the few page I found contained either BOM or 
HTTP-level charset=utf-8. I have to little research material when it 
comes to UTF-8 pages with charset=unicode inside.

  (2) for the user tests you suggested in Mozilla bug 708995 (above),
 the presence of meta charset=UNICODE would trigger a need for Firefox
 users to select UTF-8 - unless the locale already defaults to UTF-8;
 
 Hmm. The HTML spec isn't too clear about when alias resolution
 happens, to I (incorrectly, I now think) mapped only UTF-16,
 UTF-16BE and UTF-16LE (ASCII-case-insensitive) to UTF-8 in meta
 without considering aliases at that point. Hixie, was alias resolution
 supposed to happen first? In Firefox, alias resolution happen after,
 so meta charset=iso-10646-ucs-2 is ignored per the non-ASCII
 superset rule.

Waiting to hear what Hixie says ...

 While UTF-8 is possible to detect, I really don't want to take Firefox
 down the road where users who currently don't have to suffer page load
 restarts from heuristic detection have to start suffering them. (I
 think making incremental rendering any less incremental for locales
 that currently don't use a detector is not an acceptable solution for
 avoiding restarts. With English-language pages, the UTF-8ness might
 not be apparent from the first 1024 bytes.)

 FIRSTLY, HTML5:

 ]] 8.2.2.4 Changing the encoding while parsing
 [...] This might happen if the encoding sniffing algorithm described
 above failed to find an encoding, or if it found an encoding that was
 not the actual encoding of the file. [[

 Thus, trying to detect UTF-8 is second last step of the sniffing
 algorithm. If it, correctly, detects UTF-8, then, while the detection
 probably affects performance, detecting UTF-8 should not lead to a need
 for re-parsing the page?
 
 Let's consider, for simplicity, the locales for Western Europe and the
 Americas that default to Windows-1252 today. If browser in these
 locales started doing UTF-8-only detection, they could either:
  1) Start the parse assuming Windows-1252 and reload if the detector 
 says UTF-8.

When the detector says UTF-8 - that is step 7 of the sniffing algorith, 
no?
http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

  2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
 detector says non-UTF-8.
 
 (Buffering the whole page is not an option, since it would break
 incremental loading.)
 
 Option #1 would be bad, because we'd see more and more reloading over
 time assuming that authors start using more and more UTF-8-enabled
 tools over time but don't go through the trouble of declaring UTF-8,
 since the pages already seem to work without declarations.

So the so called badness is only a theory about what will happen - how 
the web will develop. As is, there is nothing particular bad about 
starting out with UTF-8 as the assumption.

I think you are mistaken there: If parsers perform UTF-8 detection, 
then unlabelled pages will be detected, and no reparsing will happen. 
Not even increase. You at least need to explain this negative spiral 
theory better before I buy it ... Step 7 will *not* lead to reparsing 
unless the default encoding is WINDOWS-1252. If the default encoding is 
UTF-8, then step 7, when it detects UTF-8, then it means that parsing 
can continue uninterrupted.

What we will instead see is that those using legacy encodings must be 
more clever in labelling their pages, or else they won't be detected. 

I am a bitt baffled here: It sounds like you say that there will be bad 
consequences if browsers becomes more reliable ...

 Option #2 would be bad, because pages that didn't reload before would
 start reloading and possibly executing JS side effects twice.

#1 sounds least bad, since the only badness you describe is a theory 
about what this behaviour would lead to, w.r.t authors. 

 SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8
 pages, then it is the browsers *outside* that silo which eventually
 suffers (browser that do default to UTF-8 do not need to perform UTF-8
 detect, I suppose - or what?). So then it is 

Re: [whatwg] Default encoding to UTF-8?

2011-12-19 Thread Henri Sivonen
On Sun, Dec 11, 2011 at 1:21 PM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:
 (which means
 *other-language* pages when the language of the localization doesn't
 have a pre-UTF-8 legacy).

 Do you have any concrete examples?

 The example I had in mind was Welsh.

 Logical candidate. WHat do you know about the Farsi and Arabic local?

Nothing basically.

 I discovered that UNICODE is
 used as alias for UTF-16 in IE and Webkit.
...
 Seemingly, this has not affected Firefox users too much.

It surprises me greatly that Gecko doesn't treat unicode as an alias
for utf-16.

 Which must
 EITHER mean that many of these pages *are* UTF-16 encoded OR that their
 content is predominantly  US-ASCII and thus the artefacts of parsing
 UTF-8 pages (UTF-16 should be treated as UTF-8 when it isn't
 UTF-16) as WINDOWS-1252, do not affect users too much.

It's unclear to me if you are talking about HTTP-level charset=UNICODE
or charset=UNICODE in a meta. Is content labeled with charset=UNICODE
BOMless?

  (2) for the user tests you suggested in Mozilla bug 708995 (above),
 the presence of meta charset=UNICODE would trigger a need for Firefox
 users to select UTF-8 - unless the locale already defaults to UTF-8;

Hmm. The HTML spec isn't too clear about when alias resolution
happens, to I (incorrectly, I now think) mapped only UTF-16,
UTF-16BE and UTF-16LE (ASCII-case-insensitive) to UTF-8 in meta
without considering aliases at that point. Hixie, was alias resolution
supposed to happen first? In Firefox, alias resolution happen after,
so meta charset=iso-10646-ucs-2 is ignored per the non-ASCII
superset rule.

 While UTF-8 is possible to detect, I really don't want to take Firefox
 down the road where users who currently don't have to suffer page load
 restarts from heuristic detection have to start suffering them. (I
 think making incremental rendering any less incremental for locales
 that currently don't use a detector is not an acceptable solution for
 avoiding restarts. With English-language pages, the UTF-8ness might
 not be apparent from the first 1024 bytes.)

 FIRSTLY, HTML5:

 ]] 8.2.2.4 Changing the encoding while parsing
 [...] This might happen if the encoding sniffing algorithm described
 above failed to find an encoding, or if it found an encoding that was
 not the actual encoding of the file. [[

 Thus, trying to detect UTF-8 is second last step of the sniffing
 algorithm. If it, correctly, detects UTF-8, then, while the detection
 probably affects performance, detecting UTF-8 should not lead to a need
 for re-parsing the page?

Let's consider, for simplicity, the locales for Western Europe and the
Americas that default to Windows-1252 today. If browser in these
locales started doing UTF-8-only detection, they could either:
 1) Start the parse assuming Windows-1252 and reload if the detector says UTF-8.
 2) Start the parse assuming UTF-8 and reload as Windows-1252 if the
detector says non-UTF-8.

(Buffering the whole page is not an option, since it would break
incremental loading.)

Option #1 would be bad, because we'd see more and more reloading over
time assuming that authors start using more and more UTF-8-enabled
tools over time but don't go through the trouble of declaring UTF-8,
since the pages already seem to work without declarations.

Option #2 would be bad, because pages that didn't reload before would
start reloading and possibly executing JS side effects twice.

 SECONDLY: If there is a UTF-8 silo - that leads to undeclared UTF-8
 pages, then it is the browsers *outside* that silo which eventually
 suffers (browser that do default to UTF-8 do not need to perform UTF-8
 detect, I suppose - or what?). So then it is partly a matter of how
 large the silo is.

 Regardless, we must consider: The alternative to undeclared UTF-8 pages
 would be to be undeclared legacy encoding pages, roughly speaking.
 Which the browsers outside the silo then would have to detect. And
 which would be more *demand* to detect than simply detecting UTF-8.

Well, so far (except for sv-SE (but no longer) and zh-TW), Firefox has
not *by default* done cross-silo detection and has managed to get
non-trivial market share, so it's not a given that browsers from
outside a legacy silo *have to* detect.

 However, what you had in min was the change of the default encoding for
 a particular silo from legacy encoding to UTF-8. This, I agree, would
 lead to some pages being treated as UTF-8 - to begin with. But when the
 browser detects that this is wrong, it would have to switch to -
 probably - the old default - the legacy encoding.

 However, why would it switch *from* UTF-8 if UTF-8 is the default? We
 must keep the problem in mind: For the siloed browser, UTF-8 will be
 its fall-back encoding.

Doesn't the first of these two paragraphs answer the question posed in
the second one?

 It's rather counterintuitive that the persistent autodetection
 setting is in the same menu as the one-off override.

 You talk about 

Re: [whatwg] Default encoding to UTF-8?

2011-12-11 Thread Leif Halvard Silli
Leif Halvard Silli Sun Dec 11 03:21:40 PST 2011

 W.r.t. iframe, then the big in Norway newspaper Dagbladet.no is 
 declared ISO-8859-1 encoded and it includes a least one ads-iframe that 
  ...
 * Let's say that I *kept* ISO-8859-1 as default encoding, but instead 
 enabled the Universal detector. The frame then works.
 * But if I make the frame page very short, 10 * the letter ø as 
 content, then the Universal detector fails - on a test on my own 
 computer, it guess the page to be Cyrillic rather than Norwegian.
 * What's the problem? The Universal detector is too greedy - it tries 
 to fix more problems than I have. I only want it to guess on UTF-8. 
 And if it doesn't detect UTF-8, then it should fall back to the locale 
 default (including fall back to the encoding of the parent frame).

The above illustrates that the current charset-detection solutions are 
starting to get old: They are not geared and optimized towards UTF-8 as 
the firmly recommended and - in principle - anticipated default.

The above may also catch a real problem with switching to UTF-8: that 
one may need to embed pages which do not use UTF-8: If one could trust 
UAs to attempt UTF-8 detection (but not Univeral detection) before 
defaulting, then it became virtually risk free to switch a page to 
UTF-8, even if it contains iframe pages. Not?

Leif H Silli

Re: [whatwg] Default encoding to UTF-8?

2011-12-09 Thread Henri Sivonen
On Fri, Dec 9, 2011 at 12:33 AM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:
 Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
 These localizations are nevertheless live tests. If we want to move
 more firmly in the direction of UTF-8, one could ask users of those
 'live tests' about their experience.

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=708995

 (which means
 *other-language* pages when the language of the localization doesn't
 have a pre-UTF-8 legacy).

 Do you have any concrete examples?

The example I had in mind was Welsh.

 And are there user complaints?

Not that I know of, but I'm not part of a feedback loop if there even
is a feedback loop here.

 The Serb localization uses UTF-8. The Croat uses Win-1252, but only on
 Windows and Mac: On Linux it appears to use UTF-8, if I read the HG
 repository correctly.

OS-dependent differences are *very* suspicious. :-(

 I think that defaulting to UTF-8 is always a bug, because at the time
 these localizations were launched, there should have been no unlabeled
 UTF-8 legacy, because up until these locales were launched, no
 browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
 UTF-8 is harmful, because it makes it possible for locale-siloed
 unlabeled UTF-8 content come to existence

 The current legacy encodings nevertheless creates siloed pages already.
 I'm also not sure that it would be a problem with such a UTF-8 silo:
 UTF-8 is possible to detect, for browsers - Chrome seems to perform
 more such detection than other browsers.

While UTF-8 is possible to detect, I really don't want to take Firefox
down the road where users who currently don't have to suffer page load
restarts from heuristic detection have to start suffering them. (I
think making incremental rendering any less incremental for locales
that currently don't use a detector is not an acceptable solution for
avoiding restarts. With English-language pages, the UTF-8ness might
not be apparent from the first 1024 bytes.)

 In another message you suggested I 'lobby' against authoring tools. OK.
 But the browser is also an authoring tool.

In what sense?

 So how can we have authors
 output UTF-8, by default, without changing the parsing default?

Changing the default is an XML-like solution: creating breakage for
users (who view legacy pages) in order to change author behavior.

To the extent a browser is a tool Web authors use to test stuff, it's
possible to add various whining to console without breaking legacy
sites for users. See
https://bugzilla.mozilla.org/show_bug.cgi?id=672453
https://bugzilla.mozilla.org/show_bug.cgi?id=708620

 Btw: In Firefox, then in one sense, it is impossible to disable
 automatic character detection: In Firefox, overriding of the encoding
 only lasts until the next reload.

A persistent setting for changing the fallback default is in the
Advanced subdialog of the font prefs in the Content preference
pane. It's rather counterintuitive that the persistent autodetection
setting is in the same menu as the one-off override.

As for heuristic detection based on the bytes of the page, the only
heuristic that can't be disabled is the heuristic for detecting
BOMless UTF-16 that encodes Basic Latin only. (Some Indian bank was
believed to have been giving that sort of files to their customers and
it worked in pre-HTML5 browsers that silently discarded all zero
bytes prior to tokenization.) The Cyrillic and CJK detection
heuristics can be turned on and off by the user.

Within an origin, Firefox considers the parent frame and the previous
document in the navigation history as sources of encoding guesses.
That behavior is not user-configurable to my knowledge.

Firefox also remembers the encoding from previous visits as long as
Firefox otherwise has the page in cache. So for testing, it's
necessary to make Firefox forget about previous visits to the test
case.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2011-12-08 Thread Leif Halvard Silli
Henri Sivonen Tue Dec 6 23:45:11 PST 2011:
 On Mon, Dec 5, 2011 at 7:42 PM, Leif Halvard Silli wrote:

 Mozilla grants localizers a lot of latitude here. The defaults you see
 are not carefully chosen by a committee of encoding strategists doing
 whole-Web optimization at Mozilla.

We could use such a committee for the Web!

 They are chosen by individual
 localizers. Looking at which locales default to UTF-8, I think the
 most probable explanation is that the localizers mistakenly tried to
 pick an encoding that fits the language of the localization instead of
 picking an encoding that's the most successful at decoding unlabeled
 pages most likely read by users of the localization

These localizations are nevertheless live tests. If we want to move 
more firmly in the direction of UTF-8, one could ask users of those 
'live tests' about their experience.

 (which means
 *other-language* pages when the language of the localization doesn't
 have a pre-UTF-8 legacy).

Do you have any concrete examples? And are there user complaints?

The Serb localization uses UTF-8. The Croat uses Win-1252, but only on 
Windows and Mac: On Linux it appears to use UTF-8, if I read the HG 
repository correctly. As for Croat and Window-1252, then it does not 
even support the Croat alphabet (in full) - I think about the digraphs. 
But I'm not sure about the pre-UTF-8 legacy for Croatian.

Some language communities in Russia have a similar minority situation 
as Serb Cyrillic, only that their minority script is Latin: They use 
Cyrillic but they may also use Latin. But in Russia, Cyrillic 
dominates. Hence it seems to be the case - according to my earlier 
findings, that those few letters that, per each language, do not occur 
in Window-1251, are inserted as NCRs (that is: when UTF-8 is not used). 
That way, WIN-1251 can be used for Latin with non-ASCII inside. But 
given that Croat defaults to WIn-1252, they could in theory just use 
NCRs too ...

Btw, for Safari on Mac, I'm unable to see any effect of switching 
locale: Always Win-1252 (Latin) - it used to have effect before ... But 
may be there is an parameter I'm unaware of - like Apple's knowledge of 
where in the World I live ...

 I think that defaulting to UTF-8 is always a bug, because at the time
 these localizations were launched, there should have been no unlabeled
 UTF-8 legacy, because up until these locales were launched, no
 browsers defaulted to UTF-8 (broadly speaking). I think defaulting to
 UTF-8 is harmful, because it makes it possible for locale-siloed
 unlabeled UTF-8 content come to existence

The current legacy encodings nevertheless creates siloed pages already. 
I'm also not sure that it would be a problem with such a UTF-8 silo: 
UTF-8 is possible to detect, for browsers - Chrome seems to perform 
more such detection than other browsers.

Today, perhaps especially for English users, it happens all the time 
that a page - without notice - defaults with regard to encoding - and 
this causes the browser - when used as an authoring tool - defaults to 
Windows-1252: http://twitter.com/#!/komputist/status/144834229610614784 
(I suppose he used that browser based spec authoring tool that is in 
development.) 

In another message you suggested I 'lobby' against authoring tools. OK. 
But the browser is also an authoring tool. So how can we have authors 
output UTF-8, by default, without changing the parsing default?

 (instead of guiding all Web
 authors always to declare their use of UTF-8 so that the content works
 with all browser locale configurations).

On must guide authors to do this regardless.

 I have tried to lobby internally at Mozilla for stricter localizer
 oversight here but have failed. (I'm particularly worried about
 localizers turning the heuristic detector on by default for their
 locale when it's not absolutely needed, because that's actually
 performance-sensitive and less likely to be corrected by the user.
 Therefore, turning the heuristic detector on may do performance
 reputation damage. )

W.r.t. heuristic detector: Testing the default encoding behaviour for 
Firefox was difficult. But in the end I understood that I must delete 
the cached version of the Profile folder - only then would the 
encodings 'fall back' properly. But before I came thus far, I tried 
with the e.g. the Russian version of Firefox, and discovered that it 
enabled the encoding heuristics: Thus it worked! Had it not done that, 
then it would instead have used Windows-1252 as the default ... So you 
perhaps need to be careful before telling them to disable heuristics ...

Btw: In Firefox, then in one sense, it is impossible to disable 
automatic character detection: In Firefox, overriding of the encoding 
only lasts until the next reload. However, I just discovered that in 
Opera, this is not the case: If you select Windows-1252 in Opera, then 
it will always - but online the current Tab -  be Windows-1252, even if 
there is a BOM and everything. In a way 

Re: [whatwg] Default encoding to UTF-8?

2011-12-07 Thread Henri Sivonen
On Tue, Dec 6, 2011 at 2:10 AM, Kornel Lesiński kor...@geekhood.net wrote:
 On Fri, 02 Dec 2011 15:50:31 -, Henri Sivonen hsivo...@iki.fi wrote:

 That compatibility mode already exists: It's the default mode--just
 like the quirks mode is the default for pages that don't have a
 doctype. You opt out of the quirks mode by saying !DOCTYPE html. You
 opt out of the encoding compatibility mode by saying meta
 charset=utf-8.


 Could !DOCTYPE html be an opt-in to default UTF-8 encoding?

 It would be nice to minimize number of declarations a page needs to include.

I think that's a bad idea. We already have *three*
backwards-compatible ways to opt into UTF-8. !DOCTYPE html isn't one
of them. Moreover, I think it's a mistake to bundle a lot of unrelated
things into one mode switch instead of having legacy-compatible
defaults and having granular ways to opt into legacy-incompatible
behaviors. (That is, I think, in retrospect, it's bad that we have a
doctype-triggered standards mode with legacy-incompatible CSS defaults
instead of having legacy-compatible CSS defaults and CSS properties
for opting into different behaviors.)

If you want to minimize the declarations, you can put the UTF-8 BOM
followed by !DOCTYPE html at the start of the file.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread Jukka K. Korpela

2011-12-06 6:54, Leif Halvard Silli wrote:


Yeah, it would be a pity if it had already become an widespread
cargo-cult to - all at once - use HTML5 doctype without using UTF-8
*and* without using some encoding declaration *and* thus effectively
relying on the default locale encoding ... Who does have a data corpus?


I think we wound need to ask search engine developers about that, but 
what is this proposed change to defaults supposed to achieve. It would 
break any old page that does not specify the encoding, as soon as the 
the doctype is changed to !doctype html or this doctype is added to a 
page that lacked a doctype.


Since !doctype html is the simplest way to put browsers to standards 
mode, this would punish authors who have realized that their page works 
better in standards mode but are unaware of a completely different and 
fairly complex problem. (Basic character encoding issues are of course 
not that complex to you and me or most people around here; but most 
authors are more or less confused with them, and I don't think we should 
add to the confusion.)


There's a little point in changing the specs to say something very 
different from what previous HTML specs have said and from actual 
browser behavior. If the purpose is to make things more exactly defined 
(a fixed encoding vs. implementation-defined), then I think such 
exactness is a luxury we cannot afford. Things would be all different if 
we were designing a document format from scratch, with no existing 
implementations and no existing usage. If the purpose is UTF-8 
evangelism, then it would be just the kind of evangelism that produces 
angry people, not converts.


If there's something that should be added to or modified in the 
algorithm for determining character encoding, the I'd say it's error 
processing. I mean user agent behavior when it detects, after running 
the algorithm, when processing the document data, that there is a 
mismatch between them. That is, that the data contains octets or octet 
sequences that are not allowed in the encoding or that denote 
noncharacters. Such errors are naturally detected when the user agent 
processes the octets; the question is what the browser should do then.


When data that is actually in ISO-8859-1 or some similar encoding has 
been mislabeled as UTF-8 encoded, then, if the data contains octets 
outside the ASCII, character-level errors are likely to occur. Many 
ISO-8859-1 octets are just not possible in UTF-8 data. The converse 
error may also cause character-level errors. And these are not uncommon 
situations - they seem occur increasingly often, partly due to cargo 
cult use of UTF-8 (when it means declaring UTF-8 but not actually 
using it, or vice versa), partly due increased use of UTF-8 combined 
with ISO-8859-1 encoded data creeping in from somewhere into UTF-8 
encoded data.


From the user's point of view, the character-level errors currently 
result is some gibberish (e.g., some odd box appearing instead of a 
character, in one place) or in total mess (e.g. a large number non-ASCII 
characters displayed all wrong). In either case, I think an error should 
be signalled to the user, together with
a) automatically trying another encoding, such as the locale default 
encoding instead of UTF-8 or UTF-8 instead of anything else
b) suggesting to the user that he should try to view the page using some 
other encoding, possibly with a menu of encodings offered as part of the 
error explanation

c) a combination of the above.

Although there are good reasons why browsers usually don't give error 
messages, this would be a special case. It's about the primary 
interpretation of the data in the document and about a situation where 
some data has no interpretation in the assumed encoding - but usually 
has an interpretation in some other encoding.


The current Character encoding overrides rules are questionable 
because they often mask out data errors that would have helped to detect 
problems that can be solved constructively. For example, if data labeled 
as ISO-8859-1 contains an octet in the 80...9F range, then it may well 
be the case that the data is actually windows-1252 encoded and the 
override helps everyone. But it may also be the case that the data is 
in a different encoding and that the override therefore results in 
gibberish shown to the user, with no hint of the cause of the problem. 
It would therefore be better to signal a problem to the user, display 
the page using the windows-1252 encoding but with some instruction or 
hint on changing the encoding. And a browser should in this process 
really analyze whether the data can be windows-1252 encoded data that 
contains only characters permitted in HTML.


Yucca


Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread NARUSE, Yui
(2011/12/06 17:39), Jukka K. Korpela wrote:
 2011-12-06 6:54, Leif Halvard Silli wrote:
 
 Yeah, it would be a pity if it had already become an widespread
 cargo-cult to - all at once - use HTML5 doctype without using UTF-8
 *and* without using some encoding declaration *and* thus effectively
 relying on the default locale encoding ... Who does have a data corpus?

I found it: http://rink77.web.fc2.com/html/metatagu.html
It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS,
the default encoding of Japanese locale.

 Since !doctype html is the simplest way to put browsers to standards 
 mode, this would punish authors who have realized that their page works 
 better in standards mode but are unaware of a completely different and 
 fairly complex problem. (Basic character encoding issues are of course not 
 that complex to you and me or most people around here; but most authors are 
 more or less confused with them, and I don't think we should add to the 
 confusion.)

I don't think there is a page works better in standards mode than *current* 
loose mode.

 There's a little point in changing the specs to say something very different 
 from what previous HTML specs have said and from actual browser behavior. If 
 the purpose is to make things more exactly defined (a fixed encoding vs. 
 implementation-defined), then I think such exactness is a luxury we cannot 
 afford. Things would be all different if we were designing a document format 
 from scratch, with no existing implementations and no existing usage. If the 
 purpose is UTF-8 evangelism, then it would be just the kind of evangelism 
 that produces angry people, not converts.

Agreed, if we design new spec, there's no reason to choose other than UTF-8.
But HTML has long history and many content.
We already have HTML*5* pages which doesn't have encoding declaration.

 If there's something that should be added to or modified in the algorithm for 
 determining character encoding, the I'd say it's error processing. I mean 
 user agent behavior when it detects, after running the algorithm, when 
 processing the document data, that there is a mismatch between them. That is, 
 that the data contains octets or octet sequences that are not allowed in the 
 encoding or that denote noncharacters. Such errors are naturally detected 
 when the user agent processes the octets; the question is what the browser 
 should do then.

Current implementations replaces such an invalid octet with a replacement 
character.
Or some implementations scans almost the page and uses an encoding
with which all octets in the page are valid.

 When data that is actually in ISO-8859-1 or some similar encoding has been 
 mislabeled as UTF-8encoded, then, if the data contains octets outside 
 the ASCII, character-level errors are likely to occur. Many ISO-8859-1 octets 
 are just not possible in UTF-8 data. The converse error may also cause 
 character-level errors. And these are not uncommon situations - they seem 
 occur increasingly often, partly due to cargo cult use of UTF-8 (when it 
 means declaring UTF-8 but not actually using it, or vice versa), partly due 
 increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from 
 somewhere into UTF-8 encoded data.

In such case, the page should be failed to show on the author's environment.

 From the user's point of view, the character-level errors currently result is 
 some gibberish (e.g., some odd box appearing instead of a character, in one 
 place) or in total mess (e.g. a large number non-ASCII characters displayed 
 all wrong). In either case, I think an error should be signalled to the user, 
 together with
 a) automatically trying another encoding, such as the locale default encoding 
 instead of UTF-8 or UTF-8 instead of anything else
 b) suggesting to the user that he should try to view the page using some 
 other encoding, possibly with a menu of encodings offered as part of the 
 error explanation
 c) a combination of the above.

This premises that a user know the correct encoding.
But European people really know the correct encoding of ISO-8859-* pages?
I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and 
ISO-8859-2 page.

 Although there are good reasons why browsers usually don't give error 
 messages, this would be a special case. It's about the primary interpretation 
 of the data in the document and about a situation where some data has no 
 interpretation in the assumed encoding - but usually has an interpretation in 
 some other encoding.

Some browsers alerts scripting issues.
Why they cannot alerts an encoding issue?

 The current Character encoding overrides rules are questionable because 
 they often mask out data errors that would have helped to detect problems 
 that can be solved constructively. For example, if data labeled as ISO-8859-1 
 contains an octet in the 80...9F range, then it may well be the case that the 
 data is actually 

Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread Jukka K. Korpela

2011-12-06 15:59, NARUSE, Yui wrote:


(2011/12/06 17:39), Jukka K. Korpela wrote:

2011-12-06 6:54, Leif Halvard Silli wrote:


Yeah, it would be a pity if it had already become an widespread
cargo-cult to - all at once - use HTML5 doctype without using UTF-8
*and* without using some encoding declaration *and* thus effectively
relying on the default locale encoding ... Who does have a data corpus?


I found it: http://rink77.web.fc2.com/html/metatagu.html


I'm not sure of the intended purpose of that demo page, but it seems to 
illustrate my point.



It uses HTML5 doctype and not declare encoding and its encoding is Shift_JIS,
the default encoding of Japanese locale.


My Firefox uses the ISO-8859-1 encoding, my IE the windows-1252 
encoding, resulting in a mess of course. But the point is that both 
interpretations mean data errors at the character level - even seen as 
windows-1252, it contains bytes with no assigned meaning (e.g., 0x81 is 
UNDEFINED).



Current implementations replaces such an invalid octet with a replacement 
character.


No, it varies by implementation.


When data that is actually in ISO-8859-1 or some similar encoding has been mislabeled as 
UTF-8  encoded, then, if the data contains octets outside the ASCII, character-level 
errors are likely to occur. Many ISO-8859-1 octets are just not possible in UTF-8 data. 
The converse error may also cause character-level errors. And these are not uncommon 
situations - they seem occur increasingly often, partly due to cargo cult use of 
UTF-8 (when it means declaring UTF-8 but not actually using it, or vice versa), 
partly due increased use of UTF-8 combined with ISO-8859-1 encoded data creeping in from 
somewhere into UTF-8 encoded data.


In such case, the page should be failed to show on the author's environment.


An authoring tool should surely indicate the problem. But what should 
user agents do when they face such documents and need to do something 
with them?



 From the user's point of view, the character-level errors currently result is 
some gibberish (e.g., some odd box appearing instead of a character, in one 
place) or in total mess (e.g. a large number non-ASCII characters displayed all 
wrong). In either case, I think an error should be signalled to the user, 
together with
a) automatically trying another encoding, such as the locale default encoding 
instead of UTF-8 or UTF-8 instead of anything else
b) suggesting to the user that he should try to view the page using some other 
encoding, possibly with a menu of encodings offered as part of the error 
explanation
c) a combination of the above.


This premises that a user know the correct encoding.


Alternative b) means that the user can try some encodings. A user agent 
could give a reasonable list of options.


Consider the example document mentioned. When viewed in a Western 
environment, it probably looks all gibberish. Alternative a) would 
probably not help, but alternative b) would have some chances. If the 
user has some reason to suspect that the page might be in Japanese, he 
would probably try the Japanese encodings in the browser's list of 
encodings, and this would make the document readable after a try or two.



I, Japanese, imagine that it is hard that distingusih ISO-8859-1 page and 
ISO-8859-2 page.


Yes, but the idea isn't really meant to apply to such cases, as there is 
no way to detect _at the character encoding level_ to recognize 
ISO-8859-1 mislabeled as ISO-8859-2 or vice versa.



Some browsers alerts scripting issues.
Why they cannot alerts an encoding issue?


Surely they could, though I was not thinking an alert in a popup sense - 
rather, a red error indicator somewhere. There would be many more 
reasons to signal encoding issues than to signal scripting issues, as we 
know that web pages generally contain loads of client-side scripting 
errors that do not actually affect page rendering or functionality.



The current Character encoding overrides rules are questionable because they often mask out data 
errors that would have helped to detect problems that can be solved constructively. For example, if data 
labeled as ISO-8859-1 contains an octet in the 80...9F range, then it may well be the case that the data is 
actually windows-1252 encoded and the override helps everyone. But it may also be the case that 
the data is in a different encoding and that the override therefore results in gibberish shown to 
the user, with no hint of the cause of the problem.


I think such case doesn't exist.
On character encoding overrides a superset overrides a standard set.


Technically, not quite so (e.g., in ISO-8859-1, 0x81 is U+0081, a 
control character that is not allowed in HTML - I suppose, though I 
cannot really find a statement on this in HTML5 - whereas in 
windows-1252, it is undefined).


More importantly my point was about errors in data, resulting e.g. from 
a faulty code conversion or some malfunctioning software that has 
produced, 

Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread Jukka K. Korpela

2011-12-06 22:58, Leif Halvard Silli write:


There is now a bug, and the editor says the outcome depends on a
browser vendor to ship it:
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15076

Jukka K. Korpela Tue Dec 6 00:39:45 PST 2011


what is this proposed change to defaults supposed to achieve. […]


I'd say the same as in XML: UTF-8 as a reliable, common default.


The bug was created so that the argument given was:
It would be nice to minimize number of declarations a page needs to 
include.


That is, author convenience - so that authors could work sloppily and 
produce documents that could fail on user agents that haven't 
implemented this change.


This sounds more absurd than I can describe.

XML was created as a new data format; it was an entirely different issue.


If there's something that should be added to or modified in the
algorithm for determining character encoding, the I'd say it's error
processing. I mean user agent behavior when it detects, [...]


There is already an (optional) detection step in the algorithm - but UA
treat that step differently, it seems.


I'm afraid I can't find it - I mean the treatment of a document for 
which some encoding has been deduced (say, directly from HTTP headers) 
and which then turns out to violate the rules of the encoding.


Yucca




Re: [whatwg] Default encoding to UTF-8?

2011-12-06 Thread Henri Sivonen
On Mon, Dec 5, 2011 at 8:55 PM, Leif Halvard Silli
xn--mlform-...@xn--mlform-iua.no wrote:
 When you say 'requires': Of course, HTML5 recommends that you declare
 the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via
 meta charset=UTF-8). I just now also discovered that Validator.nu
 issues an error message if it does not find any of of those *and* the
 document contains non-ASCII. (I don't know, however, whether this error
 message is just something Henri added at his own discretion - it would
 be nice to have it literally in the spec too.)

I believe I was implementing exactly what the spec said at the time I
implemented that behavior of Validator.nu. I'm particularly convinced
that I was following the spec, because I think it's not the optimal
behavior. I think pages that don't declare their encoding should
always be non-conforming even if they only contain ASCII bytes,
because that way templates created by English-oriented (or lorem ipsum
-oriented) authors would be caught as non-conforming before non-ASCII
text gets filled into them later. Hixie disagreed.

 HTML5 says that validators *may* issue a warning if UTF-8 is *not* the
 encoding. But so far, validator.nu has not picked that up.

Maybe it should. However, non-UTF-8 pages that label their encoding,
that use one of the encodings that we won't be able to get rid of
anyway and that don't contain forms aren't actively harmful. (I'd
argue that they are *less* harmful than unlabeled UTF-8 pages.)
Non-UTF-8 is harmful in form submission. It would be more focused to
make the validator complain about labeled non-UTF-8 if the page
contains a form. Also, it could be useful to make Firefox whine to
console when a form is submitted in non-UTF-8 and when an HTML page
has no encoding label. (I'd much rather implement all these than
implement breaking changes to how Firefox processes legacy content.)

 We should also lobby for authoring tools (as recommended by HTML5) to
 default their output to UTF-8 and make sure the encoding is declared.

 HTML5 already says: Authoring tools should default to using UTF-8 for
 newly-created documents. [RFC3629]
 http://dev.w3.org/html5/spec/semantics.html#charset

I think focusing your efforts on lobbying authoring tool vendors to
withhold the ability to save pages in non-UTF-8 encodings would be a
better way to promote UTF-8 than lobbying browser vendors to change
the defaults in ways that'd break locale-siloed Existing Content.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
L. David Baron on Wed Nov 30 18:29:31 PST 2011:
 On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
 My understanding is that all browsers* default to Western Latin
 (ISO-8859-1) encoding by default (for Western-world
 downloads/OSes) due to legacy content on the web. But how relevant
 is that still today? Has any browser done any recent research into
 the need for this?
 
 The default varies by localization (and within that potentially by
 platform), and unfortunately that variation does matter.  You can
 see Firefox's defaults here:
 http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
 (The localization and platform are part of the filename.)

Last I checked, some of those locales defaulted to UTF-8. (And HTML5 
defines it the same.) So how is that possible? Don't users of those 
locales travel as much as you do? Or do we consider the English locale 
user's as more important? Something is broken in the logics here!

 I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
 (by changing the intl.charset.default preference), and I do see a
 decent amount of broken content as a result (maybe I encounter a new
 broken page once a week? -- though substantially more often if I'm
 looking at non-English pages because of travel).

What kind of trouble are you actually describing here? You are 
describing a problem with using UTF-8 for *your locale*. What is your 
locale? It is probably English. Or do you consider your locale to be 
'the Western world locale'? It sounds like *that* is what Anne has in 
mind when he brings in Dutch: 
http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as 
if some see Latin-1 - or Windows-1251 as we now should say - as a 
'super default' rather than a locale default. If that is the case, that 
it is a super default, then we should also spec it like that! Until 
further, I'll treat Latin-1 as it is specced: As a default for certain 
locales.)

Since it is a locale problem, we need to understand which locale you 
have - and/or which locale you - and other debaters - think they have. 
Faruk probably uses a Spanish locale - right?, so the two of you are 
not speaking out of the same context. 

However, you also say that your problem is not so much related to pages 
written for *your* locale as it is related for pages written for users 
of *other* locales. So how many times per year do Dutch, Spanish or 
Norwegian  - and other non-English pages - are creating troubles for 
you, as a English locale user? I am making an assumption: Almost never. 
You don't read those languages, do you? 

This is also an expectation thing: If you visit a Russian page in a 
legacy Cyrillic encoding, and gets mojibake because your browser 
defaults to Latin-1, then what does it matter to you whether your 
browser defaults to Latin-1 or UTF-8? Answer: Nothing. 

 I'm wondering if it might not be good to start encouraging
 defaulting to UTF-8, and only fallback to Western Latin if it is
 detected that the content is very old / served by old
 infrastructure or servers, etc. And of course if the content is
 served with an explicit encoding of Western Latin.
 
 The more complex the rules, the harder they are for authors to
 understand / debug.  I wouldn't want to create rules like those.

Agree that that particular idea is probably not the best.
 
 I would, however, like to see movement towards defaulting to UTF-8:
 the current situation makes the Web less world-wide because pages
 that work for one user don't work for another.
 
 I'm just not quite sure how to get from here to there, though, since
 such changes are likely to make users experience broken content.

I think we should 'attack' the dominating locale first: The English 
locale, in its different incarnations (Australian, American, UK). Thus, 
we should turn things on the head: English users should start to expect 
UTF-8 to be used. Because, as English users, you are more used to 
'mojibake' than the rest of us are: Whenever you see it, you 'know' 
that it is because it is a foreign language you are reading. It is we, 
the users of non-English locales, that need the default-to-legacy 
encoding behavior the most. Or, please, explain to us when and where it 
is important that English language users living in their own, native 
lands so to speak, need that their browser default to Latin-1 so that 
they can correctly read English language pages?

If the English locales start defaulting to UTF-8, then little by 
little, the same expectation etc will start spreading to the other 
locales as well, not least because the 'geeks' of each locale will tend 
to see the English locale as a super default - and they might also use 
the US English locale of their OS and/or browser. We should not 
consider the needs of geeks - they will follow (read: lead) the way, so 
the fact that *they* may see mojibake, should not be a concern.

See? We would have a plan. Or what do you think? Of course, we - or 
rather: the 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Sergiusz Wolicki
 (And HTML5 defines it the same.)

No. As far as I understand, HTML5 defines US-ASCII to be the default and
requires that any other encoding is explicitly declared. I do like this
approach.

We should also lobby for authoring tools (as recommended by HTML5) to
default their output to UTF-8 and make sure the encoding is declared.  As
so many pages, supposedly (I have not researched this), use the incorrect
encoding, it makes no sense to try to clean this mess by messing with
existing defaults. It may fix some pages and break others. Browsers have
the ability to override an incorrect encoding and this a reasonable
workaround.


-- Sergiusz


On Mon, Dec 5, 2011 at 6:42 PM, Leif Halvard Silli 
xn--mlform-...@xn--mlform-iua.no wrote:

 L. David Baron on Wed Nov 30 18:29:31 PST 2011:
  On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
  My understanding is that all browsers* default to Western Latin
  (ISO-8859-1) encoding by default (for Western-world
  downloads/OSes) due to legacy content on the web. But how relevant
  is that still today? Has any browser done any recent research into
  the need for this?
 
  The default varies by localization (and within that potentially by
  platform), and unfortunately that variation does matter.  You can
  see Firefox's defaults here:
 
 http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
  (The localization and platform are part of the filename.)

 Last I checked, some of those locales defaulted to UTF-8. (And HTML5
 defines it the same.) So how is that possible? Don't users of those
 locales travel as much as you do? Or do we consider the English locale
 user's as more important? Something is broken in the logics here!

  I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
  (by changing the intl.charset.default preference), and I do see a
  decent amount of broken content as a result (maybe I encounter a new
  broken page once a week? -- though substantially more often if I'm
  looking at non-English pages because of travel).

 What kind of trouble are you actually describing here? You are
 describing a problem with using UTF-8 for *your locale*. What is your
 locale? It is probably English. Or do you consider your locale to be
 'the Western world locale'? It sounds like *that* is what Anne has in
 mind when he brings in Dutch:
 http://blog.whatwg.org/weekly-encoding-woes (Quite often it sounds as
 if some see Latin-1 - or Windows-1251 as we now should say - as a
 'super default' rather than a locale default. If that is the case, that
 it is a super default, then we should also spec it like that! Until
 further, I'll treat Latin-1 as it is specced: As a default for certain
 locales.)

 Since it is a locale problem, we need to understand which locale you
 have - and/or which locale you - and other debaters - think they have.
 Faruk probably uses a Spanish locale - right?, so the two of you are
 not speaking out of the same context.

 However, you also say that your problem is not so much related to pages
 written for *your* locale as it is related for pages written for users
 of *other* locales. So how many times per year do Dutch, Spanish or
 Norwegian  - and other non-English pages - are creating troubles for
 you, as a English locale user? I am making an assumption: Almost never.
 You don't read those languages, do you?

 This is also an expectation thing: If you visit a Russian page in a
 legacy Cyrillic encoding, and gets mojibake because your browser
 defaults to Latin-1, then what does it matter to you whether your
 browser defaults to Latin-1 or UTF-8? Answer: Nothing.

  I'm wondering if it might not be good to start encouraging
  defaulting to UTF-8, and only fallback to Western Latin if it is
  detected that the content is very old / served by old
  infrastructure or servers, etc. And of course if the content is
  served with an explicit encoding of Western Latin.
 
  The more complex the rules, the harder they are for authors to
  understand / debug.  I wouldn't want to create rules like those.

 Agree that that particular idea is probably not the best.

  I would, however, like to see movement towards defaulting to UTF-8:
  the current situation makes the Web less world-wide because pages
  that work for one user don't work for another.
 
  I'm just not quite sure how to get from here to there, though, since
  such changes are likely to make users experience broken content.

 I think we should 'attack' the dominating locale first: The English
 locale, in its different incarnations (Australian, American, UK). Thus,
 we should turn things on the head: English users should start to expect
 UTF-8 to be used. Because, as English users, you are more used to
 'mojibake' than the rest of us are: Whenever you see it, you 'know'
 that it is because it is a foreign language you are reading. It is we,
 the users of non-English locales, that need the default-to-legacy
 encoding behavior the most. Or, please, explain to us 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
 (And HTML5 defines it the same.)
 
 No. As far as I understand, HTML5 defines US-ASCII to be the default and
 requires that any other encoding is explicitly declared. I do like this
 approach.

We are here discussing the default *user agent behaviour* - we are not 
specifically discussing how web pages should be authored.

For use agents, then please be aware that HTML5 maintains a table over 
'Suggested default encoding': 
http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding

When you say 'requires': Of course, HTML5 recommends that you declare 
the encoding (via HTTP/higher protocol, via the BOM 'sideshow' or via 
meta charset=UTF-8). I just now also discovered that Validator.nu 
issues an error message if it does not find any of of those *and* the 
document contains non-ASCII. (I don't know, however, whether this error 
message is just something Henri added at his own discretion - it would 
be nice to have it literally in the spec too.)

(The problem is of course that many English pages expect the whole 
Unicode alphabet even if they only contain US-ASCII from the start.)

HTML5 says that validators *may* issue a warning if UTF-8 is *not* the 
encoding. But so far, validator.nu has not picked that up.
 
 We should also lobby for authoring tools (as recommended by HTML5) to
 default their output to UTF-8 and make sure the encoding is declared.

HTML5 already says: Authoring tools should default to using UTF-8 for 
newly-created documents. [RFC3629] 
http://dev.w3.org/html5/spec/semantics.html#charset

 As
 so many pages, supposedly (I have not researched this), use the incorrect
 encoding, it makes no sense to try to clean this mess by messing with
 existing defaults. It may fix some pages and break others. Browsers have
 the ability to override an incorrect encoding and this a reasonable
 workaround.

Do you use a English locale computer? If you do, without being a native 
English speaker, then you are some kind of geek ... Why can't you work 
around the troubles -as you are used to anyway?

Starting a switch to UTF-8 as the default UA encoding for English 
locale users should *only* affect how English locale users experience 
languages which *both* need non-ASCII *and* historically have been 
using Windows-1252 as the default encoding *and* which additionally do 
not include any encoding declaration.
-- 
Leif Halvard Silli


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Boris Zbarsky

On 12/5/11 12:42 PM, Leif Halvard Silli wrote:

Last I checked, some of those locales defaulted to UTF-8. (And HTML5
defines it the same.) So how is that possible?


Because authors authoring pages that users of those locales tend to use 
use UTF-8 more than anything else?



Don't users of those locales travel as much as you do?


People on average travel less than David does, yes.  In all locales.

But that's not the point.  I think you completely misunderstood his 
comments about travel and locales.  Keep reading.



What kind of trouble are you actually describing here? You are
describing a problem with using UTF-8 for *your locale*.


No.  He's describing a problem using UTF-8 to view pages that are not 
written in English.


Now what language are the non-English pages you look at written in? 
Well, it depends.  In western Europe they tend to be in languages that 
can be encoded in ISO-8859-1, so authors sometimes use that encoding 
(without even realizing it).  If you set your browser to default to 
UTF-8, those pages will be broken.


In Japan, a number of pages are authored in Shift_JIS.  Those will 
similarly be broken in a browser defaulting to UTF-8.



What is your locale?


Why does it matter?  David's default locale is almost certainly en-US, 
which defaults to ISO-8859-1 (or whatever Windows-??? encoding that 
actually means on the web) in his browser.  But again, he's changed the 
default encoding from the locale default, so the locale is irrelevant.



(Quite often it sounds as
if some see Latin-1 - or Windows-1251 as we now should say - as a
'super default' rather than a locale default. If that is the case, that
it is a super default, then we should also spec it like that! Until
further, I'll treat Latin-1 as it is specced: As a default for certain
locales.)


That's exactly what it is.


Since it is a locale problem, we need to understand which locale you
have - and/or which locale you - and other debaters - think they have.


Again, doesn't matter if you change your settings from the default.


However, you also say that your problem is not so much related to pages
written for *your* locale as it is related for pages written for users
of *other* locales. So how many times per year do Dutch, Spanish or
Norwegian  - and other non-English pages - are creating troubles for
you, as a English locale user? I am making an assumption: Almost never.
You don't read those languages, do you?


Did you miss the travel part?  Want to look up web pages for museums, 
airports, etc in a non-English speaking country?  There's a good chance 
they're not in English!



This is also an expectation thing: If you visit a Russian page in a
legacy Cyrillic encoding, and gets mojibake because your browser
defaults to Latin-1, then what does it matter to you whether your
browser defaults to Latin-1 or UTF-8? Answer: Nothing.


Yes.  So?


I think we should 'attack' the dominating locale first: The English
locale, in its different incarnations (Australian, American, UK). Thus,
we should turn things on the head: English users should start to expect
UTF-8 to be used. Because, as English users, you are more used to
'mojibake' than the rest of us are: Whenever you see it, you 'know'
that it is because it is a foreign language you are reading.


Modulo smart quotes (and recently unicode ellipsis characters).  These 
are actually pretty common in English text on the web nowadays, and have 
a tendency to be in ISO-8859-1.



Or, please, explain to us when and where it
is important that English language users living in their own, native
lands so to speak, need that their browser default to Latin-1 so that
they can correctly read English language pages?


See above.


See? We would have a plan. Or what do you think?


Try it in your browser.  When I set UTF-8 as my default, there were 
broke quotation marks all over the web for me.  And I'm talking pages in 
English.


-Boris


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
Boris Zbarsky Mon Dec 5 13:49:45 PST 2011:
 On 12/5/11 12:42 PM, Leif Halvard Silli wrote:
 Last I checked, some of those locales defaulted to UTF-8. (And HTML5
 defines it the same.) So how is that possible?
 
 Because authors authoring pages that users of those locales  
 tend to use use UTF-8 more than anything else?

It is more likely that there is another reason, IMHO: They may have 
tried it, and found that it worked OK. But they of course have the same 
need for reading non-English museum and railway pages as Mozilla 
employees.

 Don't users of those locales travel as much as you do?

 I think you completely misunderstood his 
 comments about travel and locales.  Keep reading.

I'm pretty sure I haven't misunderstood very much.

 What kind of trouble are you actually describing here? You are
 describing a problem with using UTF-8 for *your locale*.
 
 No.  He's describing a problem using UTF-8 to view pages that are not 
 written in English.

And why is that a problem in those cases when it is a problem? Do he 
read those languages, anyway? Don't we expect some problems when we 
thread out of our borders?
 
 Now what language are the non-English pages you look at written in? 
 Well, it depends.  In western Europe they tend to be in languages that 
 can be encoded in ISO-8859-1, so authors sometimes use that encoding 
 (without even realizing it).  If you set your browser to default to 
 UTF-8, those pages will be broken.
 
 In Japan, a number of pages are authored in Shift_JIS.  Those will 
 similarly be broken in a browser defaulting to UTF-8.

The solution I proposed was that English locale browsers should default 
to UTF-8. Of course, to such users, then when in Japan, they could 
get problems - on some Japanese pages, which is a small nuisance, 
especially if they read Japansese.

 What is your locale?
 
 Why does it matter?  David's default locale is almost certainly en-US, 
 which defaults to ISO-8859-1 (or whatever Windows-??? encoding that 
 actually means on the web) in his browser.  But again, he's changed the 
 default encoding from the locale default, so the locale is irrelevant.

The locale is meant to predominantly be used within a physical locale. 
If he is at another physical locale or a virtually other locale, he 
should not be expecting that it works out of the box unless a common 
encoding is used. Even today, if he visits Japan, he has to either 
change his browser settings *or* to rely on the pages declaring their 
encodings. So nothing would change, for him, when visiting Japan — with 
his browser or with his computer.

Yes, there would be a change, w.r.t. Enlgish quotation marks (see 
below) and w.r.tg. visiting Western European languages pages: For those 
a number of pages which doesn't fail with Win-1252 as the default, 
would start to fail. But relatively speaking, it is less important that 
non-English pages fail for the English locale.

 (Quite often it sounds as
 if some see Latin-1 - or Windows-1251 as we now should say - as a
 'super default' rather than a locale default. If that is the case, that
 it is a super default, then we should also spec it like that! Until
 further, I'll treat Latin-1 as it is specced: As a default for certain
 locales.)
 
 That's exactly what it is.

A default for certain locales? Right.

 Since it is a locale problem, we need to understand which locale you
 have - and/or which locale you - and other debaters - think they have.
 
 Again, doesn't matter if you change your settings from the default.

I don't think I have misunderstood anything.
 
 However, you also say that your problem is not so much related to pages
 written for *your* locale as it is related for pages written for users
 of *other* locales. So how many times per year do Dutch, Spanish or
 Norwegian  - and other non-English pages - are creating troubles for
 you, as a English locale user? I am making an assumption: Almost never.
 You don't read those languages, do you?
 
 Did you miss the travel part?  Want to look up web pages for museums, 
 airports, etc in a non-English speaking country?  There's a good chance 
 they're not in English!

There is a very good chance, also, that only very few of the Web pages 
for such professional institutions would fail to declare their encoding.

 This is also an expectation thing: If you visit a Russian page in a
 legacy Cyrillic encoding, and gets mojibake because your browser
 defaults to Latin-1, then what does it matter to you whether your
 browser defaults to Latin-1 or UTF-8? Answer: Nothing.
 
 Yes.  So?

So we can look away from Greek, Cyrillic, Japanese, Chinese etc etc in 
this debate. The eventually only benefit for English locale user of 
keeping WIN-1252 as the default, is that they can have a tiny number of 
fewer problems when visiting Western-European language web pages with 
their computer. (Yes, fI saw that you mention smart quotes etc below - 
so there is that reason too.) 

 I think we should 'attack' the dominating 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Kornel Lesiński

On Fri, 02 Dec 2011 15:50:31 -, Henri Sivonen hsivo...@iki.fi wrote:


That compatibility mode already exists: It's the default mode--just
like the quirks mode is the default for pages that don't have a
doctype. You opt out of the quirks mode by saying !DOCTYPE html. You
opt out of the encoding compatibility mode by saying meta
charset=utf-8.


Could !DOCTYPE html be an opt-in to default UTF-8 encoding?

It would be nice to minimize number of declarations a page needs to  
include.


--
regards, Kornel Lesiński


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Darin Adler
On Dec 5, 2011, at 4:10 PM, Kornel Lesiński wrote:

 Could !DOCTYPE html be an opt-in to default UTF-8 encoding?
 
 It would be nice to minimize number of declarations a page needs to include.

I like that idea. Maybe it’s not too late.

-- Darin

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Boris Zbarsky

On 12/5/11 6:14 PM, Leif Halvard Silli wrote:

It is more likely that there is another reason, IMHO: They may have
tried it, and found that it worked OK


Where by it you mean open a text editor, type some text, and save. 
So they get whatever encoding their OS and editor defaults to.


And yes, then they find that it works ok, so they don't worry about 
encodings.



No.  He's describing a problem using UTF-8 to view pages that are not
written in English.


And why is that a problem in those cases when it is a problem?


Because the characters are wrong?


Do he read those languages, anyway?


Do you read English?  Seriously, what are you asking there, exactly?

(For the record, reading a particular page in a language is a much 
simpler task than reading the language; I can't read German, but I can 
certainly read a German subway map.)



The solution I proposed was that English locale browsers should default
to UTF-8.


I know the solution you proposed.  That solution tries to avoid the 
issues David was describing by only breaking things for people in 
English browser locales, I understand that.



Why does it matter?  David's default locale is almost certainly en-US,
which defaults to ISO-8859-1 (or whatever Windows-??? encoding that
actually means on the web) in his browser.  But again, he's changed the
default encoding from the locale default, so the locale is irrelevant.


The locale is meant to predominantly be used within a physical locale.


Yes, so?


If he is at another physical locale or a virtually other locale, he
should not be expecting that it works out of the box unless a common
encoding is used.


He was responding to a suggestion that the default encoding be changed 
to UTF-8 for all locales.  Are you _really_ sure you understood the 
point of his mail?



Even today, if he visits Japan, he has to either
change his browser settings *or* to rely on the pages declaring their
encodings. So nothing would change, for him, when visiting Japan — with
his browser or with his computer.


He wasn't saying it's a problem for him per se.  He's a somewhat 
sophisticated browser user who knows how to change the encoding for a 
particular page.


What he was saying is that there are lots of pages out there that aren't 
encoded in UTF-8 and rely on locale fallbacks to particular encodings, 
and that he's run into them a bunch while traveling in particular, so 
they were not pages in English.  So far, you and he seem to agree.



Yes, there would be a change, w.r.t. Enlgish quotation marks (see
below) and w.r.tg. visiting Western European languages pages: For those
a number of pages which doesn't fail with Win-1252 as the default,
would start to fail. But relatively speaking, it is less important that
non-English pages fail for the English locale.


No one is worried about that, particularly.


There is a very good chance, also, that only very few of the Web pages
for such professional institutions would fail to declare their encoding.


You'd be surprised.


Modulo smart quotes (and recently unicode ellipsis characters).  These
are actually pretty common in English text on the web nowadays, and have
a tendency to be in ISO-8859-1.


If we change the default, they will start to tend to be in UTF-8.


Not unless we change the authoring tools.  Half the time these things 
are just directly exported from a word processor.



OK: Quotation marks. However, in 'old web pages', then you also find
much more use of HTML entities (such asldquo;) than you find today.
We should take advantage of that, no?


I have no idea what you're trying to say,


When you mention quotation marks, then you mention a real locale
related issue. And may be the Euro sign too?


Not an issue for me personally, but it could be for some, yes.


Nevertheless, the problem is smallest for languages that primarily limit their 
alphabet to those
letter that are present in the American Standard Code for Information
Interchange format.


Sure.  It may still be too big.


It would be logical, thus, to start the switch to
UTF-8 for those locales


If we start at all.


Perhaps we need to have a project to measure these problems, instead of
all these anecdotes?


Sure.  More data is always better than ancedotes.

-Boris


Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
On 12/5/11 6:14 PM, Leif Halvard Silli wrote:
 It is more likely that there is another reason, IMHO: They may have
 tried it, and found that it worked OK
 
 Where by it you mean open a text editor, type some text, and save. 
 So they get whatever encoding their OS and editor defaults to.

If that is all they tested, then I'd said they did not test enough.

 And yes, then they find that it works ok, so they don't worry about 
 encodings.

Ditto.
 
 No.  He's describing a problem using UTF-8 to view pages that are not
 written in English.

 And why is that a problem in those cases when it is a problem?
 
 Because the characters are wrong?

But the characters will be wrong many more times than exactly those 
times when he tries to read a  Web page with a Western European 
languages that is not declared as WIN-1252. Does English locale uses 
have particular expectations with regard to exactly those Web pages? 
What about Polish Web pages etc? English locale users is a very 
multiethnic lot.

 Do he read those languages, anyway?
 
 Do you read English?  Seriously, what are you asking there, exactly?

Because if it is an issue, then it is an about expectations for exactly 
those pages. (Plus the quote problem, of course.)

 (For the record, reading a particular page in a language is a much 
 simpler task than reading the language; I can't read German, but I can 
 certainly read a German subway map.)

Or Polish subway map - which doesn't default to said encoding.

 The solution I proposed was that English locale browsers should default
 to UTF-8.
 
 I know the solution you proposed.  That solution tries to avoid the 
 issues David was describing by only breaking things for people in 
 English browser locales, I understand that.

That characterization is only true with regard to the quote problem. 
That German pages breaks would not be any more important than the 
fact that Polish pages would. For that matter: It happens that UTF-8 
pages breaks as well.

I only suggest it as a first step, so to speak. Or rather - since some 
locales apparently already default to UTF-9 - as a next step. 
Thereafter, more locales would be expected to follow suit - as the 
development of each locale permits.

 Why does it matter?  David's default locale is almost certainly en-US,
 which defaults to ISO-8859-1 (or whatever Windows-??? encoding that
 actually means on the web) in his browser.  But again, he's changed the
 default encoding from the locale default, so the locale is irrelevant.

 The locale is meant to predominantly be used within a physical locale.
 
 Yes, so?

So then we have a set of expectations for the language of that locale. 
If we look at how the locale settings handles other languages, then we 
are outside the issue that the locale specific encodings are supposed 
to handle.

 If he is at another physical locale or a virtually other locale, he
 should not be expecting that it works out of the box unless a common
 encoding is used.
 
 He was responding to a suggestion that the default encoding be changed 
 to UTF-8 for all locales.  Are you _really_ sure you understood the 
 point of his mail?

I said I agreed with him that Faruk's solution was not good. However, I 
would not be against treating DOCTYPE html as a 'default to UTF-8' 
declaration, as suggested by some - if it were possible to agree about 
that. Then we could keep things as they are, except for the HTML5 
DOCTYPE. I guess the HTML5 doctype would become 'the default before the 
default': If everything else fails, then UTF-8 if the DOCTYPE is 
!DOCTYPE html, or else, the locale default.

It sounded like Darin Adler thinks it possible. How about you?
 
 Even today, if he visits Japan, he has to either
 change his browser settings *or* to rely on the pages declaring their
 encodings. So nothing would change, for him, when visiting Japan — with
 his browser or with his computer.
 
 He wasn't saying it's a problem for him per se.  He's a somewhat 
 sophisticated browser user who knows how to change the encoding for a 
 particular page.

If we are talking about English locale user visiting Japan, then I 
doubt a change in the default encoding would matter - Win-1252 as 
default would anyway be wrong.

 What he was saying is that there are lots of pages out there that aren't 
 encoded in UTF-8 and rely on locale fallbacks to particular encodings, 
 and that he's run into them a bunch while traveling in particular, so 
 they were not pages in English.  So far, you and he seem to agree.

So far we agree, yes.
 
 Yes, there would be a change, w.r.t. Enlgish quotation marks (see
 below) and w.r.tg. visiting Western European languages pages: For those
 a number of pages which doesn't fail with Win-1252 as the default,
 would start to fail. But relatively speaking, it is less important that
 non-English pages fail for the English locale.
 
 No one is worried about that, particularly.

You spoke about visiting German pages above - sounded like you worried, 
but 

Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Boris Zbarsky

On 12/5/11 9:55 PM, Leif Halvard Silli wrote:

If that is all they tested, then I'd said they did not test enough.


That's normal for the web.


(For the record, reading a particular page in a language is a much
simpler task than reading the language; I can't read German, but I can
certainly read a German subway map.)


Or Polish subway map - which doesn't default to said encoding.


Indeed.  I don't think anyone thinks the existing situation is all fine 
or anything.



I said I agreed with him that Faruk's solution was not good. However, I
would not be against treatingDOCTYPE html  as a 'default to UTF-8'
declaration


This might work, if there hasn't been too much cargo-culting yet.  Data 
urgently needed!



Not unless we change the authoring tools.  Half the time these things
are just directly exported from a word processor.


Please educate me. I'm perhaps 'handicapped' in that regard: I haven't
used MS Word on a regular basis since MS Word 5.1 for Mac. Also, if
export means copy and paste


It can mean that, or save as HTML followed by copy and paste.


then on the Mac, everything gets
converted via the clipboard


On Mac, the default OS encoding is UTF-8 last I checked.  That's 
decidedly not the case on Windows.



OK: Quotation marks. However, in 'old web pages', then you also find
much more use of HTML entities (such as“) than you find today.
We should take advantage of that, no?


I have no idea what you're trying to say,


Sorry. What I meant was that character entities are encoding
independent.


Yes.


And that lots of people - and authoring tools - have
inserted non-ASCII letters and characters as character entities,


Sure.  And lots have inserted them directly.


At any rate: A page which uses
character entities for non-ascii would render the same regardless of
encoding, hence a switch to UTF-8 would not matter for those.


Sure.  We're not worried about such pages here.

-Boris



Re: [whatwg] Default encoding to UTF-8?

2011-12-05 Thread Leif Halvard Silli
Boris Zbarsky Mon Dec 5 19:18:10 PST 2011:
 On 12/5/11 9:55 PM, Leif Halvard Silli wrote:

 I said I agreed with him that Faruk's solution was not good. However, I
 would not be against treating DOCTYPE html as a 'default to UTF-8'
 declaration
 
 This might work, if there hasn't been too much cargo-culting yet.  Data 
 urgently needed!

Yeah, it would be a pity if it had already become an widespread 
cargo-cult to - all at once - use HTML5 doctype without using UTF-8 
*and* without using some encoding declaration *and* thus effectively 
relying on the default locale encoding ... Who does have a data corpus? 
Henri, as Validator.nu developer?

This change would involve adding one more step in the HTML5 parser's 
encoding sniffing algorithm. [1] The question then is when, upon seeing 
the HTML5 doctype, the default to UTF-8 ought to happen, in order to be 
useful. It seems it would have to happen after the processing of the 
explicit meta data (Step 1 to 5) but before the last 3 steps - step 6, 
7 and 8:

Step 6: 'if the user agent has information on the likely encoding'
Step 7: UA 'may attempt to autodetect the character encoding'
Step 8: 'implementation-defined or user-specified default'

The role of the HTML5 DOCTYPE, encoding wise, would then be to ensure 
that step 6 to 8 does not happen. 

[1] http://dev.w3.org/html5/spec/parsing#encoding-sniffing-algorithm
-- 
Leif H Silli


Re: [whatwg] Default encoding to UTF-8?

2011-12-04 Thread Henri Sivonen
On Fri, Dec 2, 2011 at 6:29 PM, Glenn Maynard gl...@zewt.org wrote:
 On Fri, Dec 2, 2011 at 10:46 AM, Henri Sivonen hsivo...@iki.fi wrote:

 Regarding your (and 16) remark, considering my personal happiness at
 work, I'd prioritize the eradication of UTF-16 as an interchange
 encoding much higher than eradicating ASCII-based non-UTF-8 encodings
 that all major browsers support. I think suggesting a solution to the
 encoding problem while implying that UTF-16 is not a problem isn't
 particularly appropriate. :-)
...
 I don't think I'd call it a bigger problem, though, since it's comparatively
 (even vanishingly) rare, where untagged legacy encodings are a widespread
 problem that gets worse every day we can't think of a way to curtail it.

From implementation perspective, UTF-16 has its own class of bugs than
are unlike other encoding-related bugs and fixing those bugs is
particularly annoying because you know that UTF-16 is so rare that you
know the fix has little actual utility.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2011-12-04 Thread Glenn Maynard
On Mon, Dec 5, 2011 at 1:30 AM, Henri Sivonen hsivo...@iki.fi wrote:

 From implementation perspective, UTF-16 has its own class of bugs than
 are unlike other encoding-related bugs and fixing those bugs is
 particularly annoying because you know that UTF-16 is so rare that you
 know the fix has little actual utility.


There are lots of things like that on the platform, though, and this one
doesn't really get worse over time.  More and more content with untagged
legacy encodings accumulates every day, regularly causing user-visible
problems, which is why I'd call it a much bigger issue.

-- 
Glenn Maynard


Re: [whatwg] Default encoding to UTF-8?

2011-12-02 Thread Michael A. Puls II
On Wed, 30 Nov 2011 21:29:31 -0500, L. David Baron dba...@dbaron.org  
wrote:



I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
(by changing the intl.charset.default preference)


Just to add, in Opera, you can goto Ctrl + F12 - General tab - Language  
section - Details and set Encoding to assume for pages lacking  
specification to utf-8. Or, do it via  
opera:config#Fallback%20HTML%20Encoding.


I tried this years ago, but don't remember if it caused any problems on  
any web sites I visited. But, I quit setting it to utf-8 because I'd  
forget about it and it affected some web page encoding test cases where  
others would get different results on the tests because they had it set at  
the default (don't remember the details of the tests).


--
Michael


Re: [whatwg] Default encoding to UTF-8?

2011-12-02 Thread Henri Sivonen
On Thu, Dec 1, 2011 at 1:28 AM, Faruk Ates faruka...@me.com wrote:
 My understanding is that all browsers* default to Western Latin (ISO-8859-1) 
 encoding by default (for Western-world downloads/OSes) due to legacy content 
 on the web.

As has already been pointed out, the default depends varies by locale.

 But how relevant is that still today?

It's relevant for supporting the long tail of existing content. The
sad part is that the mechanisms that allows existing legacy content to
work within each locale silo also makes it possible for ill-informed
or uncaring authors to develop more locale-siloed content (i.e.
content that doesn't declare the encoding and, therefore, only works
when the user's fallback encoding is the same as the author's).

 I'm wondering if it might not be good to start encouraging defaulting to 
 UTF-8, and only fallback to Western Latin if it is detected that the content 
 is very old / served by old infrastructure or servers, etc. And of course if 
 the content is served with an explicit encoding of Western Latin.

I think this would be a very bad idea. It would make debugging hard.
Moreover, it would be the wrong heuristic, because well-maintained
server infrastructure can host a lot of legacy content. Consider any
shared hosting situation where the administrator of the server
software isn't the content creator.

 We like to think that “every web developer is surely building things in UTF-8 
 nowadays” but this is far from true. I still frequently break websites and 
 webapps simply by entering my name (Faruk Ateş).

For things to work, the server-side component needs to deal with what
gets sent to it. ASCII-oriented authors could still mishandle all
non-ASCII even if Web browsers forced them to deal with UTF-8 by
sending them UTF-8.

Furthermore, your proposed solution wouldn't work for legacy software
that correctly declares an encoding but declared a non-UTF-8 encoding.

Sadly, getting sites to deal with your name properly requires the
developer of each site to get a clue. :-( Just sending form
submissions in UTF-8 isn't enough if the recipient can't deal. Compare
with http://krijnhoetmer.nl/irc-logs/whatwg/20110906#l-392

 Yes, I understand that that particular issue is something we ought to fix 
 through evangelism, but I think that WHATWG/browser vendors can help with 
 this while at the same time (rightly, smartly) making the case that the web 
 of tomorrow should be a UTF-8 (and 16) based one, not a smorgasbord of 
 different encodings.

Anne has worked on speccing what exactly the smorgasbord should be.
See http://wiki.whatwg.org/wiki/Web_Encodings . I think it's not
realistic to drop encodings that are on the list of encodings you see
in the encoding menu on http://validator.nu/?charset However, I think
browsers should drop support for encodings that aren't already
supported by all the major browsers, because such encodings only serve
to enable browser-specific content and encoding proliferation.

Regarding your (and 16) remark, considering my personal happiness at
work, I'd prioritize the eradication of UTF-16 as an interchange
encoding much higher than eradicating ASCII-based non-UTF-8 encodings
that all major browsers support. I think suggesting a solution to the
encoding problem while implying that UTF-16 is not a problem isn't
particularly appropriate. :-)

 So hence my question whether any vendor has done any recent research in this. 
 Mobile browsers seem to have followed desktop browsers in this; perhaps this 
 topic was tested and researched in recent times as part of that, but I 
 couldn't find any such data. The only real relevant thread of discussion 
 around UTF-8 as a default was this one about Web Workers:
 http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-September/023197.html

 …which basically suggested that everyone is hugely in favor of UTF-8 and 
 making it a default wherever possible.

 So how 'bout it?

I think in order to comply with the Support Existing Content design
principle (even if it unfortunately means that support is siloed by
locale) and in order to make plans that are game theoretically
reasonable (not taking steps that make users migrate to browsers that
haven't taken the steps), I think we shouldn't change the fallback
encodings from what the HTML5 spec says when it comes to loading
text/html or text/plain content into a browsing context.

 What's going in this area, if anything?

There's the effort to specify a set of encodings and their aliases for
browsers to support. That's moving slowly, since Anne has other more
important specs to work on.

Other than that, there have been efforts to limit new features to
UTF-8 only (consider scripts in Workers and App Cache manifests) and
efforts to make new features not vary by locale-dependent defaults
(consider HTML in XHR). Both these efforts have faced criticism,
unfortunately.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/


Re: [whatwg] Default encoding to UTF-8?

2011-12-02 Thread Henri Sivonen
On Thu, Dec 1, 2011 at 8:29 PM, Brett Zamir bret...@yahoo.com wrote:
 How about a Compatibility Mode for the older non-UTF-8 character set
 approach, specific to page?

That compatibility mode already exists: It's the default mode--just
like the quirks mode is the default for pages that don't have a
doctype. You opt out of the quirks mode by saying !DOCTYPE html. You
opt out of the encoding compatibility mode by saying meta
charset=utf-8.

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/



Re: [whatwg] Default encoding to UTF-8?

2011-12-02 Thread Glenn Maynard
On Fri, Dec 2, 2011 at 10:46 AM, Henri Sivonen hsivo...@iki.fi wrote:

 Regarding your (and 16) remark, considering my personal happiness at
 work, I'd prioritize the eradication of UTF-16 as an interchange
 encoding much higher than eradicating ASCII-based non-UTF-8 encodings
 that all major browsers support. I think suggesting a solution to the
 encoding problem while implying that UTF-16 is not a problem isn't
 particularly appropriate. :-)


UTF-16 is definitely terrible for interchange (it's terrible for internal
use, too, but we're stuck with that), and I'm all for anything that
prevents its proliferation.

I don't think I'd call it a bigger problem, though, since it's
comparatively (even vanishingly) rare, where untagged legacy encodings are
a widespread problem that gets worse every day we can't think of a way to
curtail it.

I don't have any new ideas for doing that, either, though.

I think in order to comply with the Support Existing Content design
 principle (even if it unfortunately means that support is siloed by
 locale) and in order to make plans that are game theoretically
 reasonable (not taking steps that make users migrate to browsers that
 haven't taken the steps), I think we shouldn't change the fallback
 encodings from what the HTML5 spec says when it comes to loading
 text/html or text/plain content into a browsing context.


And no browser vendor would ever do this, no matter what the spec says,
since nobody's willing to break massive swaths of existing content.

-- 
Glenn Maynard


Re: [whatwg] Default encoding to UTF-8?

2011-12-01 Thread Sergiusz Wolicki
I have read section 4.2.5.5 of the WHATWG HTML spec and I think it is
sufficient.  It requires that any non-US-ASCII document has an explicit
character encoding declaration. It also recommends UTF-8 for all new
documents and for authoring tools' default encoding.  Therefore, any
document conforming to HTML5 should not pose any problem in this area.

The default encoding issue is therefore for old stuff.  But I have seen a
lot of pages, in browsers and in mail, that were tagged with one encoding
and encoded in another.  Hence, documents without a charset declaration are
only one of the reasons of garbage we see. Therefore, I see no point in
trying to fix anything in browsers by changing the ancient defaults
(risking compatibility issues). Energy should go into filing bugs against
misbehaving authoring tools and into adding proper recommendations and
education in HTML guidelines and tutorials.


Thanks,
Sergiusz


On Thu, Dec 1, 2011 at 7:00 AM, L. David Baron dba...@dbaron.org wrote:

 On Thursday 2011-12-01 14:37 +0900, Mark Callow wrote:
  On 01/12/2011 11:29, L. David Baron wrote:
   The default varies by localization (and within that potentially by
   platform), and unfortunately that variation does matter.
  In my experience this is what causes most of the breakage. It leads
  people to create pages that do not specify the charset encoding. The
  page works fine in the creator's locale but shows mojibake (garbage
  characters) for anyone in a different locale.
 
  If the default was ASCII everywhere then all authors would see mojibake,
  unless it really was an ASCII-only page, which would force them to set
  the charset encoding correctly.

 Sure, if the default were consistent everywhere we'd be fine.  If we
 have a choice in what that default is, UTF-8 is probably a good
 choice unless there's some advantage to another one.  But nobody's
 figured out how to get from here to there.

 (I think this is legacy from the pre-Unicode days, when the browser
 simply displayed Web pages using to the system character set, which
 led to a legacy of incompatible Web pages in different parts of the
 world.)

 -David

 --
 턞   L. David Baron http://dbaron.org/   턂
 턢   Mozilla   http://www.mozilla.org/   턂



Re: [whatwg] Default encoding to UTF-8?

2011-12-01 Thread Brett Zamir

On 12/1/2011 2:00 PM, L. David Baron wrote:

On Thursday 2011-12-01 14:37 +0900, Mark Callow wrote:

On 01/12/2011 11:29, L. David Baron wrote:

The default varies by localization (and within that potentially by
platform), and unfortunately that variation does matter.

In my experience this is what causes most of the breakage. It leads
people to create pages that do not specify the charset encoding. The
page works fine in the creator's locale but shows mojibake (garbage
characters) for anyone in a different locale.

If the default was ASCII everywhere then all authors would see mojibake,
unless it really was an ASCII-only page, which would force them to set
the charset encoding correctly.

Sure, if the default were consistent everywhere we'd be fine.  If we
have a choice in what that default is, UTF-8 is probably a good
choice unless there's some advantage to another one.  But nobody's
figured out how to get from here to there.
How about a Compatibility Mode for the older non-UTF-8 character set 
approach, specific to page?  I wholeheartedly agree that something 
should be done here, preventing yet more content from piling up in 
outdated ways without any consequences. (Same with email clients too, I 
would hope as well.)


Brett



Re: [whatwg] Default encoding to UTF-8?

2011-11-30 Thread Jukka K. Korpela

2011-12-01 1:28, Faruk Ates wrote:


My understanding is that all browsers* default to Western Latin (ISO-8859-1)
 encoding by default (for Western-world downloads/OSes) due to legacy 
content on the web.


Browsers default to various encodings, often windows-1252 (rather than 
ISO-8859-1). They may also investigate the actual data and make a guess 
based on it.



I'm wondering if it might not be good to start encouraging defaulting to UTF-8,


It would not. There’s no reason to recommend any particular defaulting, 
especially not something that deviates from past practices.


It might be argued that browsers should do better error detection and 
reporting, so that they inform the user e.g. if the document’s encoding 
has not been declared at all and it cannot be inferred fairly reliably 
(e.g., from BOM). But I’m afraid the general feeling is that browsers 
should avoid warning users, as that tends to contradict authors’ 
purposes – and, in fact, mostly things that are serious problems in 
principle aren’t that serious in practice.



We like to think that “every web developer is surely building things in UTF-8 
nowadays”

 but this is far from true.

There’s a large amount of pages declared as UTF-8 but containing Ascii 
only, as well as pages mislabeled as UTF-8 but containing e.g. ISO-8859-1.



I still frequently break websites and webapps simply by entering my name (Faruk 
Ateş).


That’s because the server-side software (and possibly client-side 
software) cannot handle the letter “ş”. It would not help if the page 
were interpreted as UTF-8. If the author knows that a server-side form


Yucca


Re: [whatwg] Default encoding to UTF-8?

2011-11-30 Thread L. David Baron
On Wednesday 2011-11-30 15:28 -0800, Faruk Ates wrote:
 My understanding is that all browsers* default to Western Latin
 (ISO-8859-1) encoding by default (for Western-world
 downloads/OSes) due to legacy content on the web. But how relevant
 is that still today? Has any browser done any recent research into
 the need for this?

The default varies by localization (and within that potentially by
platform), and unfortunately that variation does matter.  You can
see Firefox's defaults here:
http://mxr.mozilla.org/l10n-mozilla-beta/search?string=intl.charset.default
(The localization and platform are part of the filename.)

I changed my Firefox from the ISO-8859-1 default to UTF-8 years ago
(by changing the intl.charset.default preference), and I do see a
decent amount of broken content as a result (maybe I encounter a new
broken page once a week? -- though substantially more often if I'm
looking at non-English pages because of travel).

 I'm wondering if it might not be good to start encouraging
 defaulting to UTF-8, and only fallback to Western Latin if it is
 detected that the content is very old / served by old
 infrastructure or servers, etc. And of course if the content is
 served with an explicit encoding of Western Latin.

The more complex the rules, the harder they are for authors to
understand / debug.  I wouldn't want to create rules like those.

I would, however, like to see movement towards defaulting to UTF-8:
the current situation makes the Web less world-wide because pages
that work for one user don't work for another.

I'm just not quite sure how to get from here to there, though, since
such changes are likely to make users experience broken content.

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla   http://www.mozilla.org/   턂


Re: [whatwg] Default encoding to UTF-8?

2011-11-30 Thread Mark Callow


On 01/12/2011 11:29, L. David Baron wrote:
 The default varies by localization (and within that potentially by
 platform), and unfortunately that variation does matter.
In my experience this is what causes most of the breakage. It leads
people to create pages that do not specify the charset encoding. The
page works fine in the creator's locale but shows mojibake (garbage
characters) for anyone in a different locale.

If the default was ASCII everywhere then all authors would see mojibake,
unless it really was an ASCII-only page, which would force them to set
the charset encoding correctly.

Regards

-Mark


Re: [whatwg] Default encoding to UTF-8?

2011-11-30 Thread L. David Baron
On Thursday 2011-12-01 14:37 +0900, Mark Callow wrote:
 On 01/12/2011 11:29, L. David Baron wrote:
  The default varies by localization (and within that potentially by
  platform), and unfortunately that variation does matter.
 In my experience this is what causes most of the breakage. It leads
 people to create pages that do not specify the charset encoding. The
 page works fine in the creator's locale but shows mojibake (garbage
 characters) for anyone in a different locale.
 
 If the default was ASCII everywhere then all authors would see mojibake,
 unless it really was an ASCII-only page, which would force them to set
 the charset encoding correctly.

Sure, if the default were consistent everywhere we'd be fine.  If we
have a choice in what that default is, UTF-8 is probably a good
choice unless there's some advantage to another one.  But nobody's
figured out how to get from here to there.

(I think this is legacy from the pre-Unicode days, when the browser
simply displayed Web pages using to the system character set, which
led to a legacy of incompatible Web pages in different parts of the
world.)

-David

-- 
턞   L. David Baron http://dbaron.org/   턂
턢   Mozilla   http://www.mozilla.org/   턂