subject:"Re\: Detection of unlabeled UTF\-8"

Re: Detection of unlabeled UTF-8

2013-09-11 Thread Jean-Marc Desperrier


Adam Roach a écrit :

when you look at that document, tell me what you think the parenthetical
phrase after the author's name is supposed to look like -- because I can
guarantee that Firefox isn't doing the right thing here.


In my case it does and displays : Хизер Фланаган
I have the universal charset detector activated.

The simple thing Firefox could do is to interpret "text/plain" and 
"text/plain;charset=us-ascii" as UTF-8 by default.


In the first case, handling of non ascii character is undefined, and in 
the second they should be illegal.
So in these two cases, UTF-8 can not break anything that is supposed to 
work.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-10 Thread Neil

And then you get sites that send ISO-8859-1 but the server is configured 
to send UTF-8 in the headers, e.g. 
http://darwinawards.com/darwin/darwin1999-38.html


--
Warning: May contain traces of nuts.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-09 Thread Adam Roach


On 9/9/13 02:31, Henri Sivonen wrote:

We don't have telemetry for the question "How often are pages that are not
labeled as UTF-8, UTF-16 or anything that maps to their replacement
encoding according to the Encoding Standard and that contain non-ASCII
bytes in fact valid UTF-8?" How rare would the mislabeled UTF-8 case need
to be for you to consider the UI that you're proposing not worth it?


I'd think it would depend somewhat on the severity of the misencoding. 
For example, interpreting a page of UTF-8 as Windows-1252 isn't 
generally going to completely ruin a page with the occasional accented 
Latin character, although it will certainly be an obvious defect. I'd be 
happy to leave the situation be if this happened to fewer than 1% of 
users over a six week period.


On the other hand, misrendering a page of UTF-8 that consists 
predominantly of a non-Latin character set is pretty catastrophic, and 
is going to tend to happen to the same subset of users over and over 
again. For that situation, I think I'd like to see fewer than 0.1% of 
users who have a build that has been localized into a non-Latin 
character set impacted over a six-week period before I was happy leaving 
things as-is.



However, we do have telemetry for the percentage of Firefox sessions in
which the  current character encoding override UI has been used at least
once. See https://bugzilla.mozilla.org/show_bug.cgi?id=906032 for the
results broken down by desktop versus Android and then by locale.


I don't think measuring the behavior those few people who know about 
this feature is particularly relevant. The status quo works for them, by 
definition. I'm far more concerned about those users who get garbled 
pages and don't have the knowledge to do anything about it.



I would accept  a (performance-conscious) patch for gathering telemetry for
the UTF-8 question in the HTML parser.  However, I'm not volunteering to
write one myself immediately, because I have bugs on my todo list that have
been caused by previous attempts of Gecko developers to be well-intentioned
about DWIM and UI around character encodings. Gotta fix those first.


Great. I'll see if I can wedge in some time to put one together 
(although I'm similarly swamped, so I don't have a good timeframe for 
this). If anyone else has time to roll one out, that would be even better.



Even non-automatic correction means authors can take the attitude that
getting the encoding wrong is no big deal since the fix is a click away for
the user.


I'll repeat that it's not our job to police the web. I'm firmly of the 
opinion that those developers who don't care about doing things right 
won't do them right no matter how big a stick you personally choose to 
beat them with. On the other hand, I'm quite worried about collateral 
damage to our users in your crusade to control publishers.


Give the publishers the tools to understand their errors, and the users 
the tools to use the web the way they want to use it. Those publishers 
who aren't bad actors will correct their own behavior -- those who _are_ 
bad actors aren't going to behave anyway. There's no point getting 
authoritarian about it and making the web a less accessible place as a 
consequence.


--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-09 Thread Henri Sivonen

On Fri, Sep 6, 2013 at 6:17 PM, Adam Roach  wrote:

> Sure. It's a much trickier problem (and, in any case, the UI is
> necessarily more intrusive than what I'm suggesting). There's no good way
> to explain the nuanced implications of security decisions in a way that is
> both accessible to a lay user and concise enough to hold the average user's
> attention.
>

Yes, the decisions that the user is asked to make in the case of HTTPS
deployment errors  are more difficult than the decision whether to reload
the page as UTF-8.

(Just for completeness, I should mention that what you're proposing could
be security-sensitive without some further tweaks. For starters, if a page
has been labeled as UTF-16 or  anything that maps to the replacement
encoding according to the Encoding Standard, we should not let the user
reload the page as UTF-8. When I say "labeled as UTF-16", I mean labels
that are supposed to take effect as UTF-16 per WHATWG HTML. I don't mean
the sort of bogus UTF-16 labels that actually are treated as UTF-8 labels
by WHATWG HTML.)

> To the first point: the increase in complexity is fairly minimal for a
> substantial gain in usability.
>

How substantial the gain in usability  would be is not known without exact
telemetry, but see below.

As for complexity, as the person who has  been working with the relevant
code the most in the last couple of years,  I think we should try to get
rid of the code for implementing encoding overrides by the user instead of
coming up with new ways to trigger that code. Thanks to e.g. the mistake of
introducing UTF-16 as an interchange encodinge to the Web, that code has
needed security fixes.

> Absent hard statistics, I suspect we will disagree about how "fringe" this
> particular exception is. Suffice it to say that I have personally
> encountered it as a problem as recently as last week. If you think we need
> to move beyond anecdotes and personal experience, let's go ahead and add
> telemetry to find out how often this arises in the field.
>

We don't have telemetry for the question "How often are pages that are not
labeled as UTF-8, UTF-16 or anything that maps to their replacement
encoding according to the Encoding Standard and that contain non-ASCII
bytes in fact valid UTF-8?" How rare would the mislabeled UTF-8 case need
to be for you to consider the UI that you're proposing not worth it?

However, we do have telemetry for the percentage of Firefox sessions in
which the  current character encoding override UI has been used at least
once. See https://bugzilla.mozilla.org/show_bug.cgi?id=906032 for the
results broken down by desktop versus Android and then by locale.  One
could speculate the answer to the UTF-8 question relative to this telemetry
data both ways: Since the general character encoding override usage
includes cases where the encoding being switched to is not to UTF-8, one
could expect the UTF-8 case to be even more fringe than what these
telemetry results show. On the other hand, these telemetry results show
only cases where the user is aware of the existence of the character
encoding override UI and bothers to use it, so one could argue that the
UTF-8 case could actually be more common.

I would accept  a (performance-conscious) patch for gathering telemetry for
the UTF-8 question in the HTML parser.  However, I'm not volunteering to
write one myself immediately, because I have bugs on my todo list that have
been caused by previous attempts of Gecko developers to be well-intentioned
about DWIM and UI around character encodings. Gotta fix those first.

Your second point is an argument against automatic correction. Don't get me
> wrong: I think automatic correction leads to innocent publisher mistakes
> that make things worse over the long term. I absolutely agree that doing so
> trades short-term gain for long-term damage. But I'm not arguing for
> automatic correction.
>

Even non-automatic correction means authors can take the attitude that
getting the encoding wrong is no big deal since the fix is a click away for
the user. But how will that UI work in non-browser apps that load Web
content on B2G, etc.?

On Fri, Sep 6, 2013 at 6:45 PM, Robert Kaiser  wrote:
> Hmm, do we have to treat the whole document as a consistent charset?

The practical answer is yes.

> Could
> we instead, if we don't know the charset, look at every rendered-as-text
> node/attribute in the DOM tree and run some kind of charset detection on
it?
>
> May be a dumb idea but might avoid the problem on the parsing level.

And then we'd have at least 34 problems (if my quick count of legacy
encodings was correct). On a more serious note, though, it's a bad idea to
try to develop complex solutions to problems that are actually relatively
rare on the Web these days and it's even worse to  go deeper into DWIM when
experience shows that DWIM  in this area is a big part of the reason we
have this mess.

On Fri, Sep 6, 2013 at 7:36 PM, Neil Harris  wrote:
> http://w3techs.com/techn

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris


On 06/09/13 18:28, Boris Zbarsky wrote:

On 9/6/13 1:11 PM, Neil Harris wrote:

Presumably most of that XHTML is being generated by automated tools


Presumably most of that "XHTML" are tag-soup pages which claim to have 
an XHTML doctype.  The chance of them actually being valid XHTML is 
slim to none (though maybe higher than the chance of them being served 
as application/xhtml+xml).


-Boris



Indeed. I suspect in quite a lot of these cases the reason for using an 
XHTML doctype was that XHTML's got an "X", and anything with an "X" 
added has _got_ to be better.


-- XNeil


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Marcos Caceres




On Friday, September 6, 2013 at 5:36 PM, Neil Harris wrote:

> On 06/09/13 16:34, Gervase Markham wrote:
> > 
> > Data! Sounds like a plan.
> > 
> > Or we could ask our friends at Google or some other search engine to run
> > a version of our detector over their index and see how often it says
> > "UTF-8" when our normal algorithm would say something else.
> > 
> > Gerv
> This website has an interesting, and apparently up-to-date set of 
> statistics:
> 


Wait a minute, they also claim that XHTML is used on  54.9% of sites? I'm 
skeptical of their methodology. See:  

http://w3techs.com/technologies/overview/markup_language/all
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris


On 06/09/13 17:48, Marcos Caceres wrote:



On Friday, September 6, 2013 at 5:36 PM, Neil Harris wrote:


On 06/09/13 16:34, Gervase Markham wrote:

Data! Sounds like a plan.

Or we could ask our friends at Google or some other search engine to run
a version of our detector over their index and see how often it says
"UTF-8" when our normal algorithm would say something else.

Gerv

This website has an interesting, and apparently up-to-date set of
statistics:



Wait a minute, they also claim that XHTML is used on  54.9% of sites? I'm 
skeptical of their methodology. See:

http://w3techs.com/technologies/overview/markup_language/all



On reading that, that surprised me too.

However, that doesn't seem too far from this site's estimate for the 
same thing:


http://try.powermapper.com/Stats/HtmlVersions

Presumably most of that XHTML is being generated by automated tools 
whose authors assumed that XHTML represented the "latest and greatest" 
HTML spec, and who it seems are now in the process of transitioning to 
the new latest-and-greatest, HTML 5, as shown by the rising tide of HTML 
5 in the above graph.


-- Neil

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Robert Kaiser


Henri Sivonen schrieb:

Considering what Aryeh said earlier in this thread, do you have a
suggestion how to do that so that

> [...]

Hmm, do we have to treat the whole document as a consistent charset? 
Could we instead, if we don't know the charset, look at every 
rendered-as-text node/attribute in the DOM tree and run some kind of 
charset detection on it?


May be a dumb idea but might avoid the problem on the parsing level.

Robert Kaiser
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Boris Zbarsky


On 9/6/13 1:11 PM, Neil Harris wrote:

Presumably most of that XHTML is being generated by automated tools


Presumably most of that "XHTML" are tag-soup pages which claim to have 
an XHTML doctype.  The chance of them actually being valid XHTML is slim 
to none (though maybe higher than the chance of them being served as 
application/xhtml+xml).


-Boris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris


On 06/09/13 16:34, Gervase Markham wrote:


Data! Sounds like a plan.

Or we could ask our friends at Google or some other search engine to run
a version of our detector over their index and see how often it says
"UTF-8" when our normal algorithm would say something else.

Gerv
This website has an interesting, and apparently up-to-date set of 
statistics:


http://w3techs.com/technologies/overview/character_encoding/all

Their current top ten encodings, as of today, are:

UTF-8: 76.7%
ISO-8859-1: 11.7%
Windows-1251 (Cyrillic): 2.9%
GB2312 (Chinese): 2.5%
Shift JIS (Japanese): 1.5%
Windows-1252 (superset of ISO-8859-1): 1.4%
GBK (Chinese): 0.7%
ISO-8859-2 (Eastern Europe, Latin script): 0.4%
EUC-JP (Japanese): 0.4%
Windows-1256 (Arabic): 0.4%

Although the exact interpretation of these results is tricky, since they 
don't give their criteria for exactly how they define and detect these 
decodings, if their results are even approximately right, it's pretty 
clear that UTF-8 now dominates the web as the single commonest 
charset/encoding by far.


-- N.

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris


On 06/09/13 16:45, Robert Kaiser wrote:

Henri Sivonen schrieb:

Considering what Aryeh said earlier in this thread, do you have a
suggestion how to do that so that

> [...]

Hmm, do we have to treat the whole document as a consistent charset? 
Could we instead, if we don't know the charset, look at every 
rendered-as-text node/attribute in the DOM tree and run some kind of 
charset detection on it?


May be a dumb idea but might avoid the problem on the parsing level.

Robert Kaiser



I think that would create a whole lot more problems than it would fix, 
and would be unworkable in practice.


Charset detection from content is a probabilistic matter at best, and 
treating the document as many small snippets of text would not only 
increase the probability of the detection algorithm getting it wrong for 
each node, but also give a large number of opportunities per page for at 
least one of those detections to go wrong.


-- N.


___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Adam Roach


On 9/6/13 04:25, Henri Sivonen wrote:

We do surface such UI for https deployment errors
inspiring academic papers about how bad it is  that users are exposed
to such UI.


Sure. It's a much trickier problem (and, in any case, the UI is 
necessarily more intrusive than what I'm suggesting). There's no good 
way to explain the nuanced implications of security decisions in a way 
that is both accessible to a lay user and concise enough to hold the 
average user's attention.



On Thu, Sep 5, 2013 at 6:15 PM, Adam Roach  wrote:

As to the "why," it comes down to balancing the need to let the publisher
know that they've done something wrong against punishing the user for the
publisher's sins.

Two problems:
  1) The complexity of the platform increases in order to address a fringe case.
  2) Making publishers' misdeeds less severe in the short term makes it
more OK for publishers to engage in the misdeeds, which in the light
of #1 leads to long-term problems. (Consider the character encoding
situation in Japan and how HTML parsing in Japanese Firefox is worse
than in other locales as the result.)


To the first point: the increase in complexity is fairly minimal for a 
substantial gain in usability. Absent hard statistics, I suspect we will 
disagree about how "fringe" this particular exception is. Suffice it to 
say that I have personally encountered it as a problem as recently as 
last week. If you think we need to move beyond anecdotes and personal 
experience, let's go ahead and add telemetry to find out how often this 
arises in the field.


Your second point is an argument against automatic correction. Don't get 
me wrong: I think automatic correction leads to innocent publisher 
mistakes that make things worse over the long term. I absolutely agree 
that doing so trades short-term gain for long-term damage. But I'm not 
arguing for automatic correction.


But it's not our job to police the web.

It's our job to... and I'm going to borrow some words here... give users 
"the ability to shape their own experiences on the Internet." You're 
arguing _against_ that for the purposes of trying to control a group of 
publishers who, for whatever reason, either lack the ability or don't 
care enough to fix their content even when their tools clearly tell them 
that their content is broken.


--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Gervase Markham

On 06/09/13 16:17, Adam Roach wrote:
> To the first point: the increase in complexity is fairly minimal for a
> substantial gain in usability. Absent hard statistics, I suspect we will
> disagree about how "fringe" this particular exception is. Suffice it to
> say that I have personally encountered it as a problem as recently as
> last week. If you think we need to move beyond anecdotes and personal
> experience, let's go ahead and add telemetry to find out how often this
> arises in the field.

Data! Sounds like a plan.

Or we could ask our friends at Google or some other search engine to run
a version of our detector over their index and see how often it says
"UTF-8" when our normal algorithm would say something else.

Gerv
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Henri Sivonen

On Thu, Sep 5, 2013 at 7:32 PM, Mike Hoye  wrote:
> On 2013-09-05 10:10 AM, Henri Sivonen wrote:
>>
>> It's worth noting that for other classes of authoring errors (except for
>> errors in https deployment) we don't give the user the tools to remedy
>> authoring errors.
>
> Firefox silently remedies all kinds authoring errors.

Silently, yes. I was trying to ask about surfacing error-remedying UI
to the user. We do surface such UI for https deployment errors
inspiring academic papers about how bad it is  that users are exposed
to such UI.

On Thu, Sep 5, 2013 at 9:29 PM, Robert Kaiser  wrote:
> UTF-8 is what is being suggested
> everywhere as the encoding to go with, and as we should be able to detect it
> easily enough, we should do it and switch to it when we find it.

Considering what Aryeh said earlier in this thread, do you have a
suggestion how to do that so that
 1) Incremental parsing and rendering aren't hindered.
AND
 2) The results are deterministic and reliable and don't depend on the
byte position of the first non-ASCII byte in the data stream.
AND
 3) The processing of referenced unlabeled CSS and JavaScript doesn't
have race conditions even with speculative parsing involved, is
unsurprising and doesn't break legacy content.
AND
 4) We don't incur the performance penalty of re-parsing or
re-building the DOM if authors start labeling UTF-8 less due to no
longer having to label.
AND
 5) Side effects of scripts are not effected twice if authors start
labeling UTF-8 less due to no longer having to label.
?

On Thu, Sep 5, 2013 at 6:15 PM, Adam Roach  wrote:
> As to the "why," it comes down to balancing the need to let the publisher
> know that they've done something wrong against punishing the user for the
> publisher's sins.

Two problems:
 1) The complexity of the platform increases in order to address a fringe case.
 2) Making publishers' misdeeds less severe in the short term makes it
more OK for publishers to engage in the misdeeds, which in the light
of #1 leads to long-term problems. (Consider the character encoding
situation in Japan and how HTML parsing in Japanese Firefox is worse
than in other locales as the result.)

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Boris Zbarsky


On 9/5/13 11:15 AM, Adam Roach wrote:

I would argue that we do, to some degree, already do this for things
like Content-Encoding. For example, if a website attempts to send
gzip-encoded bodies without a Content-Encoding header, we don't simply
display the compressed body as if it were encoded according to the
indicated type


Actually, we do, unless the indicated type is text/plain.

The one fixup I'm aware of with Content-Encoding is that if the content 
type is application/gzip and the Content-Encoding is gzip and the file 
extension is .gz we ignore the Content-Encoding.


Both of these are workarounds for a very widespread server 
misconfiguration (in particular, the default Apache configuration for 
many years had the text/plain problem and the default Apache 
configuration on most Linux distributions had the gzip problwm).


-Boris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Robert Kaiser


Zack Weinberg schrieb:

It is possible to distinguish UTF-8 from most legacy
encodings heuristically with high reliability, and I'd like to suggest
that we ought to do so, independent of locale.


I would very much agree with doing that. UTF-8 is what is being 
suggested everywhere as the encoding to go with, and as we should be 
able to detect it easily enough, we should do it and switch to it when 
we find it.


Robert Kaiser

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Mike Hoye


On 2013-09-05 10:10 AM, Henri Sivonen wrote:
It's worth noting that for other classes of authoring errors (except 
for errors in https deployment) we don't give the user the tools to 
remedy authoring errors.


Firefox silently remedies all kinds authoring errors.

- mhoye
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Adam Roach


On 9/5/13 09:10, Henri Sivonen wrote:

Why should we surface this class of authoring error to the UI in a way 
that asks the user to make a decision considering how rare this class 
of authoring error is?


It's not a matter of the user judging the rarity of the condition; it's 
the user being able to, by casual observation, look at a web page and 
tell that something is messed up in a way that makes it unusable for them.



Are there other classes of authoring errors
that you think should have UI for the user to second-guess the author?
If yes, why? If not, why not?


In theory, yes. In practice, I can't immediately think of any instances 
that fit the class other than this one and certain Content-Encoding issues.


If you want to reduce it to principle, I would say that we should 
consider it for any authoring error that is (a) relatively common in the 
wild; (b) trivially detectable by a lay user; (c) trivially detectable 
by the browser; (d) mechanically reparable by the browser; and (e) has 
the potential to make a page completely useless.


I would argue that we do, to some degree, already do this for things 
like Content-Encoding. For example, if a website attempts to send 
gzip-encoded bodies without a Content-Encoding header, we don't simply 
display the compressed body as if it were encoded according to the 
indicated type; we pop up a dialog box to ask the user what to do with 
the body.


I'm proposing nothing more radical than this existing behavior, except 
in a more user-friendly form.


As to the "why," it comes down to balancing the need to let the 
publisher know that they've done something wrong against punishing the 
user for the publisher's sins.



--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Henri Sivonen

On Fri, Aug 30, 2013 at 6:17 PM, Adam Roach  wrote:
>
> It seems to me that there's an important balance here between (a) letting 
> developers discover their configuration error and (b) allowing users to 
> render misconfigured content without specialized knowledge.

It's worth noting that for other classes of authoring errors  (except
for errors in https deployment) we  don't give the user the tools to
remedy authoring errors.

> Both of these are valid concerns, and I'm afraid that we're not assigning 
> enough weight to the user perspective.

Assigning weight to the *short-term* user perspective seems to be what
got  us into this mess in the first place. If Netscape had never had a
manual override for the character encoding or locale-specific
differences, user-exposed brokenness  would have quickly taught
authors to get their act encoding together--especially in the context
of languages like Japanese where a wrong encoding guess makes the page
completely unreadable.

(The obvious counter-argument is that  in the case of languages that
use a non-Latin, getting the  encoding  wrong is near the the YSoD
level of disaster and it's agreed that XML's error handling was a
mistake compared to HTML's. However, HTML's error handling surfaces no
UI choices to the user, works without having to reload the page and is
now well specified. Furthermore, even in the case of HTML, hindsight
says we'd be better off if no browser had tried to be too helpful
about fixing  in the first place.)

> I think we can find some middle ground here, where we help developers 
> discover their misconfiguration, while also handing users the tool they need 
> to fix it. Maybe an unobtrusive bar (similar to the password save bar) that 
> says something like: "This page's character encoding appears to be 
> mislabeled, which might cause certain characters to display incorrectly. 
> Would you like to reload this page as Unicode? [Yes] [No] [More Information] 
> [x]".

Why should we surface this class of authoring error to the UI in a way
that asks the user to make a decision considering how rare this class
of authoring error is? Are there other classes of authoring errors
that you think should have UI for the user to second-guess the author?
If yes, why? If not, why not?

That is, why is the case where text/html is in fact valid UTF-8 and
contains non-ASCII characters but has not been declared as UTF-8 so
special compared to other possible authoring errors that it should
have special treatment?

On Fri, Aug 30, 2013 at 8:24 PM, Mike Hoye  wrote:
> For what it's worth Internet Explorer handled this (before UTF-8 and caring
> about JS performance were a thing) by guessing what encoding to use,
> comparing a letter-frequency-analysis of a page's content to a table of what
> bytes are most common in which in what encodings of whatever languages.

Is there evidence of IE doing this  in locales other than Japanese,
Russian and Ukrainian? Or even locales other than Japanese? Firefox
does this only for the Japanese, Russian and Ukrainian locales.

(FWIW, studying whether this is still needed for the Russian and
Ukrainian locales is
https://bugzilla.mozilla.org/show_bug.cgi?id=845791 .  As for
Japanese, some sort of detection magic is probably staying for the
foreseeable future. It appears that Microsoft fairly recently tried to
take ISO-2022-JP out of their detector for security reasons but had to
put it back for compatibility: http://support.microsoft.com/kb/2416400
http://support.microsoft.com/kb/2482017 )

> It's
> probably not a suitable approach in modernity, because of performance
> problems and horrible-though-rare edge cases.

See point #3 in https://bugzilla.mozilla.org/show_bug.cgi?id=910211#c2

On Fri, Aug 30, 2013 at 9:33 PM, Joshua Cranmer 🐧  wrote:
> The problem I have with this approach is that it assumes that the page is
> authored by someone who definitively knows the charset, which is not a
> scenario which universally holds. Suppose you have a page that serves up the
> contents of a plain text file, so your source data has no indication of its
> charset. What charset should the page report?

Your scenario assumes that the page template is ASCII-only. If it
isn't, browser-side guessing doesn't solve the problem. Even when the
template is ASCII-only, whoever wrote the inclusion on the server
probably has better contextual knowledge about what the encoding of
the input text could be then the browser has.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-04 Thread Adam Roach


On 9/2/13 13:36, Joshua Cranmer 🐧 wrote:
I don't think there *is* a sane approach that satisfies everybody. 
Either you break "UTF8-just-works-everywhere", you break legacy 
content, you make parsing take inordinate times...


I want to push on this last point a bit. Using a straightforward UTF-8 
detection algorithm (which could probably stand some optimization), it 
takes my laptop somewhere between 0.9 ms and 1.4 ms to scan a _Megabyte_ 
buffer in order to check whether it consists entirely of valid UTF-8 
sequences (the speed variation depends on what proportion of the 
characters in the buffer are higher than U+007F). That hardly even rises 
to the level of noise.



--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-02 Thread Joshua Cranmer 🐧


On 8/30/2013 1:41 PM, Anne van Kesteren wrote:

On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧  wrote:

The problem I have with this approach is that it assumes that the page is
authored by someone who definitively knows the charset, which is not a
scenario which universally holds. Suppose you have a page that serves up the
contents of a plain text file, so your source data has no indication of its
charset. What charset should the page report? The choice is between guessing
(presumably UTF-8) or saying nothing (which causes the browser to guess
Windows-1252, generally).

Where did the text file come from?
The example I have in mind is something like MXR. The text file is some 
"external" source (say, a file in some source repository).

There's a source somewhere... And
these days that's hardly how people create content anyway.
I would guess that most content these days does not consist of static 
pages but rather dynamically-generated content that is amalgamated from 
several databases of various kinds. These sources don't necessarily 
annotate their text with their charset (indeed, the entire problem we're 
discussing is due to people not annotating text with its charset). I 
know of at least one blog where the comments (and only the comments) get 
mojibake'd (UTF-8->ISO-8859-1->UTF-8), and I recall in the past seeing 
an RSS feed that got double-mojibake'd 
(UTF-8->ISO-8859-1->UTF-8->ISO-8859-1->UTF-8). Those examples aren't 
something the browser can fix, but it should make clear that authors 
have much less control (and/or knowledge) over the source charsets of 
their data than you would expect.

And again,
it has already been pointed out we cannot scan the entire byte stream
(since text/plain uses the HTML parser it goes for that too, unless we
make an exception I suppose, but what data supports that?), which
would make the situation worse.
I don't think there *is* a sane approach that satisfies everybody. 
Either you break "UTF8-just-works-everywhere", you break legacy content, 
you make parsing take inordinate times... or you might be find a happy 
medium if you're willing to make document.charset lie. :-)


--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-09-02 Thread Anne van Kesteren

On Fri, Aug 30, 2013 at 8:36 PM, Adam Roach  wrote:
> On 8/30/13 13:41, Anne van Kesteren wrote:
>> Where did the text file come from? There's a source somewhere... And
>> these days that's hardly how people create content anyway.
>
> Maybe not for the content _you_ consume, but the Internet is a bit larger 
> than our ivory tower.

I was talking about content creation.

As for consumption, I'd love to see data that shows that unlabeled
utf-8 content is common.


-- 
http://annevankesteren.nl/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-31 Thread Neil

Mike Hoye wrote:

On 2013-08-30 3:17 PM, Adam Roach wrote:

On 8/30/13 14:11, Adam Roach wrote:

...helping the user understand why the headline they're trying to
read renders as "Ð' Ð"Ð¾Ñ?Ð´ÑfÐ¼Ðµ Ð¿ÑEURÐµÐ´Ð»Ð¾Ð¶Ð¸Ð»Ð¸
Ð¾Ñ,Ð¾Ð±ÑEURÐ°Ñ,ÑOE "Ð?Ð¾Ð±ÐµÐ»Ñ?" Ñf Ðz(Ð±Ð°Ð¼Ñ< " rather than "?
??? ?? "??" ? ?".

Well, *there's* a heavy dose of irony in the context of this thread.
I wonder what rules our mailing list server applies for character set
decimation.

When I sent that out, the question marks were a perfectly readable
string of Cyrillic characters.

Which provides a strong object lesson in the fact that character set
configuration is hard. If we can't get this right internally, I think
we've lost the moral ground in saying that others should be able to,
and tough luck if they can't.

For what it's worth, the original came through Thunderbird as a
perfectly legitimate string of Russian at my end:

??? ?? ? , ??? ?? ??? , ??? ?? ??
? ??. ??? ?? ???, ??? ?? ??? ??
, ?? ??? ?? ? ??? ? ???.

I just see question marks here, but then again the headers in both
messages declare a character set of ISO-8859-1.

As for the original message, it seems to have been corrupted, for
instance € characters have been turned into EUR. Maybe it got
"converted" from Windows-1252 (which has the € character) into
ISO-8859-1(which does not)?

(I remembered at the last minute to change my character coding to
something other than ISO-8859-1 so hopefully those euro signs pass
through intact.)

--
Warning: May contain traces of nuts.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach


On 8/30/13 13:41, Anne van Kesteren wrote:

Where did the text file come from? There's a source somewhere... And
these days that's hardly how people create content anyway.


Maybe not for the content _you_ consume, but the Internet is a bit 
larger than our ivory tower.


Check out, for example:

https://www.rfc-editor.org/rse/wiki/lib/exe/fetch.php?media=design:future-unpag-20130820.txt

In particular, when you look at that document, tell me what you think 
the parenthetical phrase after the author's name is supposed to look 
like -- because I can guarantee that Firefox isn't doing the right thing 
here.



And again, it has already been pointed out we cannot scan the entire byte stream


Sure we can. We just can't fix things on the fly: we'd need something 
akin to a user prompt and probably a page reload. Which is what I'm 
proposing.



--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Mike Hoye


On 2013-08-30 3:17 PM, Adam Roach wrote:

On 8/30/13 14:11, Adam Roach wrote:
...helping the user understand why the headline they're trying to 
read renders as "Ð' Ð"Ð¾Ñ?Ð´ÑfÐ¼Ðµ Ð¿ÑEURÐµÐ´Ð»Ð¾Ð¶Ð¸Ð»Ð¸ 
Ð¾Ñ,Ð¾Ð±ÑEURÐ°Ñ,ÑOE "Ð?Ð¾Ð±ÐµÐ»Ñ?" Ñf Ðz(Ð±Ð°Ð¼Ñ< " rather than "? 
??? ??  "??" ? ?".


Well, *there's* a heavy dose of irony in the context of this thread. I 
wonder what rules our mailing list server applies for character set 
decimation.


When I sent that out, the question marks were a perfectly readable 
string of Cyrillic characters.


Which provides a strong object lesson in the fact that character set 
configuration is hard. If we can't get this right internally, I think 
we've lost the moral ground in saying that others should be able to, 
and tough luck if they can't.


For what it's worth, the original came through Thunderbird as a 
perfectly legitimate string of Russian at my end:



??? ?? ? , ??? ?? ??? , ??? ??  ?? ? 
 ??. ??? ??   ???, ??? ?? ??? ?? 
, ?? ??? ??  ? ??? ? ???.



- mhoye
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach

On 8/30/13 12:24, Mike Hoye wrote:

On 2013-08-30 11:17 AM, Adam Roach wrote:

It seems to me that there's an important balance here between (a)
letting developers discover their configuration error and (b)
allowing users to render misconfigured content without specialized
knowledge.

For what it's worth Internet Explorer handled this (before UTF-8 and
caring about JS performance were a thing) by guessing what encoding to
use, comparing a letter-frequency-analysis of a page's content to a
table of what bytes are most common in which in what encodings of
whatever languages.

...
From both the developer and user perspectives, it was amounted to
"something went wrong because of bad magic."

I'd like to clarify two points about what I'm proposing.

First, I'm not proposing that we do anything without explicit user
intervention, other than present an unobtrusive bar helping the user
understand why the headline they're trying to read renders as "Ð'
Ð"Ð¾Ñ?Ð´ÑfÐ¼Ðµ Ð¿ÑEURÐµÐ´Ð»Ð¾Ð¶Ð¸Ð»Ð¸ Ð¾Ñ,Ð¾Ð±ÑEURÐ°Ñ,ÑOE "Ð?Ð¾Ð±ÐµÐ»Ñ?"
Ñf Ðz(Ð±Ð°Ð¼Ñ< " rather than "? ??? ?? "??" ?
?". (No political statement intended here -- that's just the leading
headline on Pravda at the moment).

If the user is happy with the encoding, they do nothing and go about
their business.

If the user determines that the rendering is, in fact, not what they
want, they can simply click on the "Yes" button and (with high
probability), everything is right with the world again.

Also note that I'm not proposing that we try to do generic character set
and language detection. That's fraught with the perils you cite. The
topic we're discussing here is UTF-8, which can be easily detected with
extremely high confidence.

--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach


On 8/30/13 14:11, Adam Roach wrote:
...helping the user understand why the headline they're trying to read 
renders as "Ð' Ð"Ð¾Ñ?Ð´ÑfÐ¼Ðµ Ð¿ÑEURÐµÐ´Ð»Ð¾Ð¶Ð¸Ð»Ð¸ 
Ð¾Ñ,Ð¾Ð±ÑEURÐ°Ñ,ÑOE "Ð?Ð¾Ð±ÐµÐ»Ñ?" Ñf Ðz(Ð±Ð°Ð¼Ñ< " rather than "? 
??? ??  "??" ? ?".


Well, *there's* a heavy dose of irony in the context of this thread. I 
wonder what rules our mailing list server applies for character set 
decimation.


When I sent that out, the question marks were a perfectly readable 
string of Cyrillic characters.


Which provides a strong object lesson in the fact that character set 
configuration is hard. If we can't get this right internally, I think 
we've lost the moral ground in saying that others should be able to, and 
tough luck if they can't.


/a
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Anne van Kesteren

On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧  wrote:
> The problem I have with this approach is that it assumes that the page is
> authored by someone who definitively knows the charset, which is not a
> scenario which universally holds. Suppose you have a page that serves up the
> contents of a plain text file, so your source data has no indication of its
> charset. What charset should the page report? The choice is between guessing
> (presumably UTF-8) or saying nothing (which causes the browser to guess
> Windows-1252, generally).

Where did the text file come from? There's a source somewhere... And
these days that's hardly how people create content anyway. And again,
it has already been pointed out we cannot scan the entire byte stream
(since text/plain uses the HTML parser it goes for that too, unless we
make an exception I suppose, but what data supports that?), which
would make the situation worse.

-- 
http://annevankesteren.nl/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Joshua Cranmer 🐧


On 8/30/2013 4:01 AM, Anne van Kesteren wrote:

On Fri, Aug 30, 2013 at 9:40 AM, Gervase Markham  wrote:

We don't want people to try and move to UTF-8, but move back because
they haven't figured out how (or are technically unable) to label it
correctly and "it comes out all wrong".

You also don't want it to be wrong half of the time. Given that full
content scans won't fly (we try to restrict scanning for encodings as
much as possible), that's a very real possibility, especially given
forums such as in OP that are mostly ASCII.

Labeling is what people ought to do, and it's very easy:  (if all other files end up unlabeled, they'll inherit
from this one).


The problem I have with this approach is that it assumes that the page 
is authored by someone who definitively knows the charset, which is not 
a scenario which universally holds. Suppose you have a page that serves 
up the contents of a plain text file, so your source data has no 
indication of its charset. What charset should the page report? The 
choice is between guessing (presumably UTF-8) or saying nothing (which 
causes the browser to guess Windows-1252, generally).


--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Anne van Kesteren

On Fri, Aug 30, 2013 at 6:31 PM, Chris Peterson  wrote:
> Is there a less error-prone default we can recommend to Linux distribution
> packagers? Maybe we can squelch the problem upstream instead of adding
> browser hacks. The number of web server and distro packagers we would need
> to reach out to is probably pretty small.

The least error prone is not having a default. That way HTTP does not
override content and  works. Understanding of HTTP is
severely limited so it being authoritative is kind of a problem here.


-- 
http://annevankesteren.nl/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Chris Peterson


On 8/30/13 3:03 AM, Henri Sivonen wrote:

Telemetry data suggests that these days the more common reason for
seeing mojibake is that there is an encoding declaration but it is
wrong.  My guess is that this arises from Linux distributions silently
changing their Apache defaults to send a charset parameter in
ContentType on the theory that it's good for security to send one even
if the person packaging Apache logically can have no clue of what the
value of the parameter should be for a specific deployment. (I think
we should not start second guessing encoding declarations.)


Is there a less error-prone default we can recommend to Linux 
distribution packagers? Maybe we can squelch the problem upstream 
instead of adding browser hacks. The number of web server and distro 
packagers we would need to reach out to is probably pretty small.



chris
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Mike Hoye


On 2013-08-30 11:17 AM, Adam Roach wrote:

It seems to me that there's an important balance here between (a) 
letting developers discover their configuration error and (b) allowing 
users to render misconfigured content without specialized knowledge.


For what it's worth Internet Explorer handled this (before UTF-8 and 
caring about JS performance were a thing) by guessing what encoding to 
use, comparing a letter-frequency-analysis of a page's content to a 
table of what bytes are most common in which in what encodings of 
whatever languages. It's probably not a suitable approach in modernity, 
because of performance problems and horrible-though-rare edge cases.


If whatever you'd written turned out to have an unusual letter frequency 
or (worse) when a comment added to your badly-written CMS tripped that 
switch, your previously-Korean page would suddenly and magically start 
rendering in Hebrew or something, and unless you knew something about 
character encoding in IE it was basically impossible to figure out what 
had gone wrong or why. From both the developer and user perspectives, it 
was amounted to "something went wrong because of bad magic."




- mhoye
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach


On 8/30/13 05:08, Nicholas Nethercote wrote:

On Fri, Aug 30, 2013 at 8:03 PM, Henri Sivonen  wrote:

I think we should encourage Web authors to use UTF-8  *and* to *declare* it.

I'm no expert on this stuff, but Henri's point sure sound sensible to me.



It seems to me that there's an important balance here between (a) 
letting developers discover their configuration error and (b) allowing 
users to render misconfigured content without specialized knowledge.


Both of these are valid concerns, and I'm afraid that we're not 
assigning enough weight to the user perspective.


I think we can find some middle ground here, where we help developers 
discover their misconfiguration, while also handing users the tool they 
need to fix it. Maybe an unobtrusive bar (similar to the password save 
bar) that says something like: "This page's character encoding appears 
to be mislabeled, which might cause certain characters to display 
incorrectly. Would you like to reload this page as Unicode? [Yes] [No] 
[More Information] [x]".



--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Henri Sivonen

On Fri, Aug 30, 2013 at 4:31 PM, Aryeh Gregor  wrote:
> In particular, you need to decide on the encoding before you start
> running any user script, because you don't want document.characterSet
> etc. to change once it might have already been accessed.  For
> performance reasons, we want to be able to run scripts immediately
> after receiving the initial TCP response, if there are any to run yet.
>  This implies we need to decide on character set after reading the
> first segment, which typically will not contain the actual content of
> the page that we would want to sniff on pages like
> http://www.eyrie-productions.com/.  Right?

Right.

> (I say this only because my initial reaction was that we could hold
> off on deciding what encoding to use until we find the first non-ASCII
> byte without any ill effects, if we really wanted to.  That would
> probably make the site in question work.  But then I realized it would
> break document.characterSet, so it's not an option even if we wanted
> more sniffing.)

Right. The idea occurred to me, too, and then I thought of scripts and styles.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Aryeh Gregor

On Fri, Aug 30, 2013 at 1:03 PM, Henri Sivonen  wrote:
> This is true if  you run the heuristic over the entire byte stream.
> Unfortunately, since we support incremental loading of HTML (and will
> have to continue to do so), we don't have the entire byte stream
> available at the time when we need to make a decision of what encoding
> to assume.

In particular, you need to decide on the encoding before you start
running any user script, because you don't want document.characterSet
etc. to change once it might have already been accessed.  For
performance reasons, we want to be able to run scripts immediately
after receiving the initial TCP response, if there are any to run yet.
 This implies we need to decide on character set after reading the
first segment, which typically will not contain the actual content of
the page that we would want to sniff on pages like
http://www.eyrie-productions.com/.  Right?

(I say this only because my initial reaction was that we could hold
off on deciding what encoding to use until we find the first non-ASCII
byte without any ill effects, if we really wanted to.  That would
probably make the site in question work.  But then I realized it would
break document.characterSet, so it's not an option even if we wanted
more sniffing.)
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Nicholas Nethercote

On Fri, Aug 30, 2013 at 8:03 PM, Henri Sivonen  wrote:
>
> I think we should encourage Web authors to use UTF-8  *and* to *declare* it.

I'm no expert on this stuff, but Henri's point sure sound sensible to me.

Nick
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Henri Sivonen

On Thu, Aug 29, 2013 at 9:41 PM, Zack Weinberg  wrote:
> All the discussion of fallback character encodings has reminded me of an
> issue I've been meaning to bring up for some time: As a user of the en-US
> localization, nowadays the overwhelmingly most common situation where I see
> mojibake is when a site puts UTF-8 in its pages without declaring any
> encoding at all (neither via  nor Content-Type).

Telemetry data suggests that these days the more common reason for
seeing mojibake is that there is an encoding declaration but it is
wrong.  My guess is that this arises from Linux distributions silently
changing their Apache defaults to send a charset parameter in
ContentType on the theory that it's good for security to send one even
if the person packaging Apache logically can have no clue of what the
value of the parameter should be for a specific deployment. (I think
we should not start second guessing encoding declarations.)

> It is
> possible to distinguish UTF-8 from most legacy encodings heuristically with
> high reliability, and I'd like to suggest that we ought to do so,
> independent of locale.

This is true if  you run the heuristic over the entire byte stream.
Unfortunately, since we support incremental loading of HTML (and will
have to continue to do so), we don't have the entire byte stream
available at the time when we need to make a decision of what encoding
to assume.

> Having read through a bunch of the "fallback encoding is wrong" bugs Henri's
> been filing, I have the impression that Henri would prefer we *not* detect
> UTF-8

Correct. Every time a localization sets the fallback to UTF-8 or a
heuristic detector detects unlabeled UTF-8 is an opportunity for Web
authors to generate a new legacy of unlabeled UTF-8 content thinking
that everything is okay.

> 1. There exist sites that still regularly add new, UTF-8-encoded content,
> but whose *structure* was laid down in the late 1990s or early 2000s,
> declares no encoding, and is unlikely ever to be updated again. The example
> I have to hand is
> http://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded
> ; many other posts on this forum have the same problem. Take note of the
> vintage HTML. I suggested to the admins of this site that they add  charset="utf-8"> to the master page template, and was told that no one
> involved in current day-to-day operations has the necessary access
> privileges. I suspect that this kind of situation is rather more common than
> we would like to believe.

It's easy to have an anecdotal single data point of something on the
Web are being broken.  Is there any data on how common this problem is
relative to other legacy encoding phenomena?

> 2. For some of the fallback-encoding-is-wrong bugs still open, a binary
> UTF-8/unibyte heuristic would save the localization from having to choose
> between displaying legacy minority-language content correctly and displaying
> legacy hegemonic-language content correctly. If I understand correctly, this
> is the case at least for Welsh:
> https://bugzilla.mozilla.org/show_bug.cgi?id=844087 .

If we hadn't been defaulting to UTF-8 in any localization at any
point, the minority-language unlabeled UTF-8 legacy would not have had
a chance to develop. It's terrible  that after having made the initial
mistake of letting unlabeled non-UTF-8 legacy to develop, the mistake
has been repeated for some localizations to allow a legacy of
unlabeled UTF-8 to develop.  We might still have a chance of stopping
the new legacy of unlabeled UTF-8 from developing.

> 3. Files loaded from local disk have no encoding metadata from the
> transport, and may have no in-band label either; in particular, UTF-8 plain
> text with no byte order mark, which is increasingly common, should not be
> misidentified as the legacy encoding.

When accessing the local disk, it might indeed make sense to examine
all the bytes of the file before starting parsing.

> Having a binary UTF-8/unibyte
> heuristic might address some of the concerns mentioned in the "File API
> should not use 'universal' character detection" bug,
> https://bugzilla.mozilla.org/show_bug.cgi?id=848842 .

I think in the case of the File API, we should just implement what the
spec says and assume UTF-8. I think it's reprehensible that we have
pulled non-spec magic out of thin air here.

> If people are concerned about "infecting" the modern platform with
> heuristics, perhaps we could limit application of the heuristic to quirks
> mode, for HTML delivered over HTTP.

I'm not particularly happy about the prospect of having to change the
order of the quirkiness determination and the encoding determination.

On Fri, Aug 30, 2013 at 11:40 AM, Gervase Markham  wrote:
> That seems wise to me, on gut instinct.

It looks to me that it was gut instinct that led to stuff like the
Esperanto locale  setting the fallback to UTF-8 thereby  making the
locale top the list of character encoding overwr

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Anne van Kesteren

On Fri, Aug 30, 2013 at 9:40 AM, Gervase Markham  wrote:
> We don't want people to try and move to UTF-8, but move back because
> they haven't figured out how (or are technically unable) to label it
> correctly and "it comes out all wrong".

You also don't want it to be wrong half of the time. Given that full
content scans won't fly (we try to restrict scanning for encodings as
much as possible), that's a very real possibility, especially given
forums such as in OP that are mostly ASCII.

Labeling is what people ought to do, and it's very easy:  (if all other files end up unlabeled, they'll inherit
from this one).

-- 
http://annevankesteren.nl/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Gervase Markham

On 29/08/13 19:41, Zack Weinberg wrote:
> All the discussion of fallback character encodings has reminded me of an
> issue I've been meaning to bring up for some time: As a user of the
> en-US localization, nowadays the overwhelmingly most common situation
> where I see mojibake is when a site puts UTF-8 in its pages without
> declaring any encoding at all (neither via  nor
> Content-Type). It is possible to distinguish UTF-8 from most legacy
> encodings heuristically with high reliability, and I'd like to suggest
> that we ought to do so, independent of locale.

That seems wise to me, on gut instinct. If the web is moving to UTF-8,
and we are trying to encourage that, then it seems we should expect that
this is what we get unless there are hints that we are wrong, whether
that's the TLD, the statistical profile of the characters, or something
else.

We don't want people to try and move to UTF-8, but move back because
they haven't figured out how (or are technically unable) to label it
correctly and "it comes out all wrong".

Gerv

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

2013-08-29 Thread Anne van Kesteren

On Thu, Aug 29, 2013 at 7:41 PM, Zack Weinberg  wrote:
> If people are concerned about "infecting" the modern platform with
> heuristics, perhaps we could limit application of the heuristic to quirks
> mode, for HTML delivered over HTTP. I expect this would cover the majority
> of the sites described under point 1, and probably 2 as well.

We should not introduce new heuristics. We could maybe introduce a new
algorithm to the platform, but only if there's buy-in across the
board. Given how fast utf-8 rises I'm not sure it's worth the effort
though for the couple of sites that might be helped.

-- 
http://annevankesteren.nl/
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

40 matches

Mail list logo