Re: Detection of unlabeled UTF-8
Adam Roach a écrit : when you look at that document, tell me what you think the parenthetical phrase after the author's name is supposed to look like -- because I can guarantee that Firefox isn't doing the right thing here. In my case it does and displays : Хизер Фланаган I have the universal charset detector activated. The simple thing Firefox could do is to interpret "text/plain" and "text/plain;charset=us-ascii" as UTF-8 by default. In the first case, handling of non ascii character is undefined, and in the second they should be illegal. So in these two cases, UTF-8 can not break anything that is supposed to work. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
And then you get sites that send ISO-8859-1 but the server is configured to send UTF-8 in the headers, e.g. http://darwinawards.com/darwin/darwin1999-38.html -- Warning: May contain traces of nuts. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 9/9/13 02:31, Henri Sivonen wrote: We don't have telemetry for the question "How often are pages that are not labeled as UTF-8, UTF-16 or anything that maps to their replacement encoding according to the Encoding Standard and that contain non-ASCII bytes in fact valid UTF-8?" How rare would the mislabeled UTF-8 case need to be for you to consider the UI that you're proposing not worth it? I'd think it would depend somewhat on the severity of the misencoding. For example, interpreting a page of UTF-8 as Windows-1252 isn't generally going to completely ruin a page with the occasional accented Latin character, although it will certainly be an obvious defect. I'd be happy to leave the situation be if this happened to fewer than 1% of users over a six week period. On the other hand, misrendering a page of UTF-8 that consists predominantly of a non-Latin character set is pretty catastrophic, and is going to tend to happen to the same subset of users over and over again. For that situation, I think I'd like to see fewer than 0.1% of users who have a build that has been localized into a non-Latin character set impacted over a six-week period before I was happy leaving things as-is. However, we do have telemetry for the percentage of Firefox sessions in which the current character encoding override UI has been used at least once. See https://bugzilla.mozilla.org/show_bug.cgi?id=906032 for the results broken down by desktop versus Android and then by locale. I don't think measuring the behavior those few people who know about this feature is particularly relevant. The status quo works for them, by definition. I'm far more concerned about those users who get garbled pages and don't have the knowledge to do anything about it. I would accept a (performance-conscious) patch for gathering telemetry for the UTF-8 question in the HTML parser. However, I'm not volunteering to write one myself immediately, because I have bugs on my todo list that have been caused by previous attempts of Gecko developers to be well-intentioned about DWIM and UI around character encodings. Gotta fix those first. Great. I'll see if I can wedge in some time to put one together (although I'm similarly swamped, so I don't have a good timeframe for this). If anyone else has time to roll one out, that would be even better. Even non-automatic correction means authors can take the attitude that getting the encoding wrong is no big deal since the fix is a click away for the user. I'll repeat that it's not our job to police the web. I'm firmly of the opinion that those developers who don't care about doing things right won't do them right no matter how big a stick you personally choose to beat them with. On the other hand, I'm quite worried about collateral damage to our users in your crusade to control publishers. Give the publishers the tools to understand their errors, and the users the tools to use the web the way they want to use it. Those publishers who aren't bad actors will correct their own behavior -- those who _are_ bad actors aren't going to behave anyway. There's no point getting authoritarian about it and making the web a less accessible place as a consequence. -- Adam Roach Principal Platform Engineer a...@mozilla.com +1 650 903 0800 x863 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Sep 6, 2013 at 6:17 PM, Adam Roach wrote: > Sure. It's a much trickier problem (and, in any case, the UI is > necessarily more intrusive than what I'm suggesting). There's no good way > to explain the nuanced implications of security decisions in a way that is > both accessible to a lay user and concise enough to hold the average user's > attention. > Yes, the decisions that the user is asked to make in the case of HTTPS deployment errors are more difficult than the decision whether to reload the page as UTF-8. (Just for completeness, I should mention that what you're proposing could be security-sensitive without some further tweaks. For starters, if a page has been labeled as UTF-16 or anything that maps to the replacement encoding according to the Encoding Standard, we should not let the user reload the page as UTF-8. When I say "labeled as UTF-16", I mean labels that are supposed to take effect as UTF-16 per WHATWG HTML. I don't mean the sort of bogus UTF-16 labels that actually are treated as UTF-8 labels by WHATWG HTML.) > To the first point: the increase in complexity is fairly minimal for a > substantial gain in usability. > How substantial the gain in usability would be is not known without exact telemetry, but see below. As for complexity, as the person who has been working with the relevant code the most in the last couple of years, I think we should try to get rid of the code for implementing encoding overrides by the user instead of coming up with new ways to trigger that code. Thanks to e.g. the mistake of introducing UTF-16 as an interchange encodinge to the Web, that code has needed security fixes. > Absent hard statistics, I suspect we will disagree about how "fringe" this > particular exception is. Suffice it to say that I have personally > encountered it as a problem as recently as last week. If you think we need > to move beyond anecdotes and personal experience, let's go ahead and add > telemetry to find out how often this arises in the field. > We don't have telemetry for the question "How often are pages that are not labeled as UTF-8, UTF-16 or anything that maps to their replacement encoding according to the Encoding Standard and that contain non-ASCII bytes in fact valid UTF-8?" How rare would the mislabeled UTF-8 case need to be for you to consider the UI that you're proposing not worth it? However, we do have telemetry for the percentage of Firefox sessions in which the current character encoding override UI has been used at least once. See https://bugzilla.mozilla.org/show_bug.cgi?id=906032 for the results broken down by desktop versus Android and then by locale. One could speculate the answer to the UTF-8 question relative to this telemetry data both ways: Since the general character encoding override usage includes cases where the encoding being switched to is not to UTF-8, one could expect the UTF-8 case to be even more fringe than what these telemetry results show. On the other hand, these telemetry results show only cases where the user is aware of the existence of the character encoding override UI and bothers to use it, so one could argue that the UTF-8 case could actually be more common. I would accept a (performance-conscious) patch for gathering telemetry for the UTF-8 question in the HTML parser. However, I'm not volunteering to write one myself immediately, because I have bugs on my todo list that have been caused by previous attempts of Gecko developers to be well-intentioned about DWIM and UI around character encodings. Gotta fix those first. Your second point is an argument against automatic correction. Don't get me > wrong: I think automatic correction leads to innocent publisher mistakes > that make things worse over the long term. I absolutely agree that doing so > trades short-term gain for long-term damage. But I'm not arguing for > automatic correction. > Even non-automatic correction means authors can take the attitude that getting the encoding wrong is no big deal since the fix is a click away for the user. But how will that UI work in non-browser apps that load Web content on B2G, etc.? On Fri, Sep 6, 2013 at 6:45 PM, Robert Kaiser wrote: > Hmm, do we have to treat the whole document as a consistent charset? The practical answer is yes. > Could > we instead, if we don't know the charset, look at every rendered-as-text > node/attribute in the DOM tree and run some kind of charset detection on it? > > May be a dumb idea but might avoid the problem on the parsing level. And then we'd have at least 34 problems (if my quick count of legacy encodings was correct). On a more serious note, though, it's a bad idea to try to develop complex solutions to problems that are actually relatively rare on the Web these days and it's even worse to go deeper into DWIM when experience shows that DWIM in this area is a big part of the reason we have this mess. On Fri, Sep 6, 2013 at 7:36 PM, Neil Harris wrote: > http://w3techs.com/techn
Re: Detection of unlabeled UTF-8
On 06/09/13 18:28, Boris Zbarsky wrote: On 9/6/13 1:11 PM, Neil Harris wrote: Presumably most of that XHTML is being generated by automated tools Presumably most of that "XHTML" are tag-soup pages which claim to have an XHTML doctype. The chance of them actually being valid XHTML is slim to none (though maybe higher than the chance of them being served as application/xhtml+xml). -Boris Indeed. I suspect in quite a lot of these cases the reason for using an XHTML doctype was that XHTML's got an "X", and anything with an "X" added has _got_ to be better. -- XNeil ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Friday, September 6, 2013 at 5:36 PM, Neil Harris wrote: > On 06/09/13 16:34, Gervase Markham wrote: > > > > Data! Sounds like a plan. > > > > Or we could ask our friends at Google or some other search engine to run > > a version of our detector over their index and see how often it says > > "UTF-8" when our normal algorithm would say something else. > > > > Gerv > This website has an interesting, and apparently up-to-date set of > statistics: > Wait a minute, they also claim that XHTML is used on 54.9% of sites? I'm skeptical of their methodology. See: http://w3techs.com/technologies/overview/markup_language/all ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 06/09/13 17:48, Marcos Caceres wrote: On Friday, September 6, 2013 at 5:36 PM, Neil Harris wrote: On 06/09/13 16:34, Gervase Markham wrote: Data! Sounds like a plan. Or we could ask our friends at Google or some other search engine to run a version of our detector over their index and see how often it says "UTF-8" when our normal algorithm would say something else. Gerv This website has an interesting, and apparently up-to-date set of statistics: Wait a minute, they also claim that XHTML is used on 54.9% of sites? I'm skeptical of their methodology. See: http://w3techs.com/technologies/overview/markup_language/all On reading that, that surprised me too. However, that doesn't seem too far from this site's estimate for the same thing: http://try.powermapper.com/Stats/HtmlVersions Presumably most of that XHTML is being generated by automated tools whose authors assumed that XHTML represented the "latest and greatest" HTML spec, and who it seems are now in the process of transitioning to the new latest-and-greatest, HTML 5, as shown by the rising tide of HTML 5 in the above graph. -- Neil ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
Henri Sivonen schrieb: Considering what Aryeh said earlier in this thread, do you have a suggestion how to do that so that > [...] Hmm, do we have to treat the whole document as a consistent charset? Could we instead, if we don't know the charset, look at every rendered-as-text node/attribute in the DOM tree and run some kind of charset detection on it? May be a dumb idea but might avoid the problem on the parsing level. Robert Kaiser ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 9/6/13 1:11 PM, Neil Harris wrote: Presumably most of that XHTML is being generated by automated tools Presumably most of that "XHTML" are tag-soup pages which claim to have an XHTML doctype. The chance of them actually being valid XHTML is slim to none (though maybe higher than the chance of them being served as application/xhtml+xml). -Boris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 06/09/13 16:34, Gervase Markham wrote: Data! Sounds like a plan. Or we could ask our friends at Google or some other search engine to run a version of our detector over their index and see how often it says "UTF-8" when our normal algorithm would say something else. Gerv This website has an interesting, and apparently up-to-date set of statistics: http://w3techs.com/technologies/overview/character_encoding/all Their current top ten encodings, as of today, are: UTF-8: 76.7% ISO-8859-1: 11.7% Windows-1251 (Cyrillic): 2.9% GB2312 (Chinese): 2.5% Shift JIS (Japanese): 1.5% Windows-1252 (superset of ISO-8859-1): 1.4% GBK (Chinese): 0.7% ISO-8859-2 (Eastern Europe, Latin script): 0.4% EUC-JP (Japanese): 0.4% Windows-1256 (Arabic): 0.4% Although the exact interpretation of these results is tricky, since they don't give their criteria for exactly how they define and detect these decodings, if their results are even approximately right, it's pretty clear that UTF-8 now dominates the web as the single commonest charset/encoding by far. -- N. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 06/09/13 16:45, Robert Kaiser wrote: Henri Sivonen schrieb: Considering what Aryeh said earlier in this thread, do you have a suggestion how to do that so that > [...] Hmm, do we have to treat the whole document as a consistent charset? Could we instead, if we don't know the charset, look at every rendered-as-text node/attribute in the DOM tree and run some kind of charset detection on it? May be a dumb idea but might avoid the problem on the parsing level. Robert Kaiser I think that would create a whole lot more problems than it would fix, and would be unworkable in practice. Charset detection from content is a probabilistic matter at best, and treating the document as many small snippets of text would not only increase the probability of the detection algorithm getting it wrong for each node, but also give a large number of opportunities per page for at least one of those detections to go wrong. -- N. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 9/6/13 04:25, Henri Sivonen wrote: We do surface such UI for https deployment errors inspiring academic papers about how bad it is that users are exposed to such UI. Sure. It's a much trickier problem (and, in any case, the UI is necessarily more intrusive than what I'm suggesting). There's no good way to explain the nuanced implications of security decisions in a way that is both accessible to a lay user and concise enough to hold the average user's attention. On Thu, Sep 5, 2013 at 6:15 PM, Adam Roach wrote: As to the "why," it comes down to balancing the need to let the publisher know that they've done something wrong against punishing the user for the publisher's sins. Two problems: 1) The complexity of the platform increases in order to address a fringe case. 2) Making publishers' misdeeds less severe in the short term makes it more OK for publishers to engage in the misdeeds, which in the light of #1 leads to long-term problems. (Consider the character encoding situation in Japan and how HTML parsing in Japanese Firefox is worse than in other locales as the result.) To the first point: the increase in complexity is fairly minimal for a substantial gain in usability. Absent hard statistics, I suspect we will disagree about how "fringe" this particular exception is. Suffice it to say that I have personally encountered it as a problem as recently as last week. If you think we need to move beyond anecdotes and personal experience, let's go ahead and add telemetry to find out how often this arises in the field. Your second point is an argument against automatic correction. Don't get me wrong: I think automatic correction leads to innocent publisher mistakes that make things worse over the long term. I absolutely agree that doing so trades short-term gain for long-term damage. But I'm not arguing for automatic correction. But it's not our job to police the web. It's our job to... and I'm going to borrow some words here... give users "the ability to shape their own experiences on the Internet." You're arguing _against_ that for the purposes of trying to control a group of publishers who, for whatever reason, either lack the ability or don't care enough to fix their content even when their tools clearly tell them that their content is broken. -- Adam Roach Principal Platform Engineer a...@mozilla.com +1 650 903 0800 x863 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 06/09/13 16:17, Adam Roach wrote: > To the first point: the increase in complexity is fairly minimal for a > substantial gain in usability. Absent hard statistics, I suspect we will > disagree about how "fringe" this particular exception is. Suffice it to > say that I have personally encountered it as a problem as recently as > last week. If you think we need to move beyond anecdotes and personal > experience, let's go ahead and add telemetry to find out how often this > arises in the field. Data! Sounds like a plan. Or we could ask our friends at Google or some other search engine to run a version of our detector over their index and see how often it says "UTF-8" when our normal algorithm would say something else. Gerv ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Thu, Sep 5, 2013 at 7:32 PM, Mike Hoye wrote: > On 2013-09-05 10:10 AM, Henri Sivonen wrote: >> >> It's worth noting that for other classes of authoring errors (except for >> errors in https deployment) we don't give the user the tools to remedy >> authoring errors. > > Firefox silently remedies all kinds authoring errors. Silently, yes. I was trying to ask about surfacing error-remedying UI to the user. We do surface such UI for https deployment errors inspiring academic papers about how bad it is that users are exposed to such UI. On Thu, Sep 5, 2013 at 9:29 PM, Robert Kaiser wrote: > UTF-8 is what is being suggested > everywhere as the encoding to go with, and as we should be able to detect it > easily enough, we should do it and switch to it when we find it. Considering what Aryeh said earlier in this thread, do you have a suggestion how to do that so that 1) Incremental parsing and rendering aren't hindered. AND 2) The results are deterministic and reliable and don't depend on the byte position of the first non-ASCII byte in the data stream. AND 3) The processing of referenced unlabeled CSS and JavaScript doesn't have race conditions even with speculative parsing involved, is unsurprising and doesn't break legacy content. AND 4) We don't incur the performance penalty of re-parsing or re-building the DOM if authors start labeling UTF-8 less due to no longer having to label. AND 5) Side effects of scripts are not effected twice if authors start labeling UTF-8 less due to no longer having to label. ? On Thu, Sep 5, 2013 at 6:15 PM, Adam Roach wrote: > As to the "why," it comes down to balancing the need to let the publisher > know that they've done something wrong against punishing the user for the > publisher's sins. Two problems: 1) The complexity of the platform increases in order to address a fringe case. 2) Making publishers' misdeeds less severe in the short term makes it more OK for publishers to engage in the misdeeds, which in the light of #1 leads to long-term problems. (Consider the character encoding situation in Japan and how HTML parsing in Japanese Firefox is worse than in other locales as the result.) -- Henri Sivonen hsivo...@hsivonen.fi http://hsivonen.iki.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 9/5/13 11:15 AM, Adam Roach wrote: I would argue that we do, to some degree, already do this for things like Content-Encoding. For example, if a website attempts to send gzip-encoded bodies without a Content-Encoding header, we don't simply display the compressed body as if it were encoded according to the indicated type Actually, we do, unless the indicated type is text/plain. The one fixup I'm aware of with Content-Encoding is that if the content type is application/gzip and the Content-Encoding is gzip and the file extension is .gz we ignore the Content-Encoding. Both of these are workarounds for a very widespread server misconfiguration (in particular, the default Apache configuration for many years had the text/plain problem and the default Apache configuration on most Linux distributions had the gzip problwm). -Boris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
Zack Weinberg schrieb: It is possible to distinguish UTF-8 from most legacy encodings heuristically with high reliability, and I'd like to suggest that we ought to do so, independent of locale. I would very much agree with doing that. UTF-8 is what is being suggested everywhere as the encoding to go with, and as we should be able to detect it easily enough, we should do it and switch to it when we find it. Robert Kaiser ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 2013-09-05 10:10 AM, Henri Sivonen wrote: It's worth noting that for other classes of authoring errors (except for errors in https deployment) we don't give the user the tools to remedy authoring errors. Firefox silently remedies all kinds authoring errors. - mhoye ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 9/5/13 09:10, Henri Sivonen wrote: Why should we surface this class of authoring error to the UI in a way that asks the user to make a decision considering how rare this class of authoring error is? It's not a matter of the user judging the rarity of the condition; it's the user being able to, by casual observation, look at a web page and tell that something is messed up in a way that makes it unusable for them. Are there other classes of authoring errors that you think should have UI for the user to second-guess the author? If yes, why? If not, why not? In theory, yes. In practice, I can't immediately think of any instances that fit the class other than this one and certain Content-Encoding issues. If you want to reduce it to principle, I would say that we should consider it for any authoring error that is (a) relatively common in the wild; (b) trivially detectable by a lay user; (c) trivially detectable by the browser; (d) mechanically reparable by the browser; and (e) has the potential to make a page completely useless. I would argue that we do, to some degree, already do this for things like Content-Encoding. For example, if a website attempts to send gzip-encoded bodies without a Content-Encoding header, we don't simply display the compressed body as if it were encoded according to the indicated type; we pop up a dialog box to ask the user what to do with the body. I'm proposing nothing more radical than this existing behavior, except in a more user-friendly form. As to the "why," it comes down to balancing the need to let the publisher know that they've done something wrong against punishing the user for the publisher's sins. -- Adam Roach Principal Platform Engineer a...@mozilla.com +1 650 903 0800 x863 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 6:17 PM, Adam Roach wrote: > > It seems to me that there's an important balance here between (a) letting > developers discover their configuration error and (b) allowing users to > render misconfigured content without specialized knowledge. It's worth noting that for other classes of authoring errors (except for errors in https deployment) we don't give the user the tools to remedy authoring errors. > Both of these are valid concerns, and I'm afraid that we're not assigning > enough weight to the user perspective. Assigning weight to the *short-term* user perspective seems to be what got us into this mess in the first place. If Netscape had never had a manual override for the character encoding or locale-specific differences, user-exposed brokenness would have quickly taught authors to get their act encoding together--especially in the context of languages like Japanese where a wrong encoding guess makes the page completely unreadable. (The obvious counter-argument is that in the case of languages that use a non-Latin, getting the encoding wrong is near the the YSoD level of disaster and it's agreed that XML's error handling was a mistake compared to HTML's. However, HTML's error handling surfaces no UI choices to the user, works without having to reload the page and is now well specified. Furthermore, even in the case of HTML, hindsight says we'd be better off if no browser had tried to be too helpful about fixing in the first place.) > I think we can find some middle ground here, where we help developers > discover their misconfiguration, while also handing users the tool they need > to fix it. Maybe an unobtrusive bar (similar to the password save bar) that > says something like: "This page's character encoding appears to be > mislabeled, which might cause certain characters to display incorrectly. > Would you like to reload this page as Unicode? [Yes] [No] [More Information] > [x]". Why should we surface this class of authoring error to the UI in a way that asks the user to make a decision considering how rare this class of authoring error is? Are there other classes of authoring errors that you think should have UI for the user to second-guess the author? If yes, why? If not, why not? That is, why is the case where text/html is in fact valid UTF-8 and contains non-ASCII characters but has not been declared as UTF-8 so special compared to other possible authoring errors that it should have special treatment? On Fri, Aug 30, 2013 at 8:24 PM, Mike Hoye wrote: > For what it's worth Internet Explorer handled this (before UTF-8 and caring > about JS performance were a thing) by guessing what encoding to use, > comparing a letter-frequency-analysis of a page's content to a table of what > bytes are most common in which in what encodings of whatever languages. Is there evidence of IE doing this in locales other than Japanese, Russian and Ukrainian? Or even locales other than Japanese? Firefox does this only for the Japanese, Russian and Ukrainian locales. (FWIW, studying whether this is still needed for the Russian and Ukrainian locales is https://bugzilla.mozilla.org/show_bug.cgi?id=845791 . As for Japanese, some sort of detection magic is probably staying for the foreseeable future. It appears that Microsoft fairly recently tried to take ISO-2022-JP out of their detector for security reasons but had to put it back for compatibility: http://support.microsoft.com/kb/2416400 http://support.microsoft.com/kb/2482017 ) > It's > probably not a suitable approach in modernity, because of performance > problems and horrible-though-rare edge cases. See point #3 in https://bugzilla.mozilla.org/show_bug.cgi?id=910211#c2 On Fri, Aug 30, 2013 at 9:33 PM, Joshua Cranmer 🐧 wrote: > The problem I have with this approach is that it assumes that the page is > authored by someone who definitively knows the charset, which is not a > scenario which universally holds. Suppose you have a page that serves up the > contents of a plain text file, so your source data has no indication of its > charset. What charset should the page report? Your scenario assumes that the page template is ASCII-only. If it isn't, browser-side guessing doesn't solve the problem. Even when the template is ASCII-only, whoever wrote the inclusion on the server probably has better contextual knowledge about what the encoding of the input text could be then the browser has. -- Henri Sivonen hsivo...@hsivonen.fi http://hsivonen.iki.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 9/2/13 13:36, Joshua Cranmer 🐧 wrote: I don't think there *is* a sane approach that satisfies everybody. Either you break "UTF8-just-works-everywhere", you break legacy content, you make parsing take inordinate times... I want to push on this last point a bit. Using a straightforward UTF-8 detection algorithm (which could probably stand some optimization), it takes my laptop somewhere between 0.9 ms and 1.4 ms to scan a _Megabyte_ buffer in order to check whether it consists entirely of valid UTF-8 sequences (the speed variation depends on what proportion of the characters in the buffer are higher than U+007F). That hardly even rises to the level of noise. -- Adam Roach Principal Platform Engineer a...@mozilla.com +1 650 903 0800 x863 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 8/30/2013 1:41 PM, Anne van Kesteren wrote: On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧 wrote: The problem I have with this approach is that it assumes that the page is authored by someone who definitively knows the charset, which is not a scenario which universally holds. Suppose you have a page that serves up the contents of a plain text file, so your source data has no indication of its charset. What charset should the page report? The choice is between guessing (presumably UTF-8) or saying nothing (which causes the browser to guess Windows-1252, generally). Where did the text file come from? The example I have in mind is something like MXR. The text file is some "external" source (say, a file in some source repository). There's a source somewhere... And these days that's hardly how people create content anyway. I would guess that most content these days does not consist of static pages but rather dynamically-generated content that is amalgamated from several databases of various kinds. These sources don't necessarily annotate their text with their charset (indeed, the entire problem we're discussing is due to people not annotating text with its charset). I know of at least one blog where the comments (and only the comments) get mojibake'd (UTF-8->ISO-8859-1->UTF-8), and I recall in the past seeing an RSS feed that got double-mojibake'd (UTF-8->ISO-8859-1->UTF-8->ISO-8859-1->UTF-8). Those examples aren't something the browser can fix, but it should make clear that authors have much less control (and/or knowledge) over the source charsets of their data than you would expect. And again, it has already been pointed out we cannot scan the entire byte stream (since text/plain uses the HTML parser it goes for that too, unless we make an exception I suppose, but what data supports that?), which would make the situation worse. I don't think there *is* a sane approach that satisfies everybody. Either you break "UTF8-just-works-everywhere", you break legacy content, you make parsing take inordinate times... or you might be find a happy medium if you're willing to make document.charset lie. :-) -- Joshua Cranmer Thunderbird and DXR developer Source code archæologist ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 8:36 PM, Adam Roach wrote: > On 8/30/13 13:41, Anne van Kesteren wrote: >> Where did the text file come from? There's a source somewhere... And >> these days that's hardly how people create content anyway. > > Maybe not for the content _you_ consume, but the Internet is a bit larger > than our ivory tower. I was talking about content creation. As for consumption, I'd love to see data that shows that unlabeled utf-8 content is common. -- http://annevankesteren.nl/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
Mike Hoye wrote: On 2013-08-30 3:17 PM, Adam Roach wrote: On 8/30/13 14:11, Adam Roach wrote: ...helping the user understand why the headline they're trying to read renders as "Ð' Ð"оÑ?дÑfме пÑEURедложили оÑ,обÑEURаÑ,ÑOE "Ð?обелÑ?" Ñf Ðz(бамÑ< " rather than "? ??? ?? "??" ? ?". Well, *there's* a heavy dose of irony in the context of this thread. I wonder what rules our mailing list server applies for character set decimation. When I sent that out, the question marks were a perfectly readable string of Cyrillic characters. Which provides a strong object lesson in the fact that character set configuration is hard. If we can't get this right internally, I think we've lost the moral ground in saying that others should be able to, and tough luck if they can't. For what it's worth, the original came through Thunderbird as a perfectly legitimate string of Russian at my end: ??? ?? ? , ??? ?? ??? , ??? ?? ?? ? ??. ??? ?? ???, ??? ?? ??? ?? , ?? ??? ?? ? ??? ? ???. I just see question marks here, but then again the headers in both messages declare a character set of ISO-8859-1. As for the original message, it seems to have been corrupted, for instance € characters have been turned into EUR. Maybe it got "converted" from Windows-1252 (which has the € character) into ISO-8859-1(which does not)? (I remembered at the last minute to change my character coding to something other than ISO-8859-1 so hopefully those euro signs pass through intact.) -- Warning: May contain traces of nuts. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 8/30/13 13:41, Anne van Kesteren wrote: Where did the text file come from? There's a source somewhere... And these days that's hardly how people create content anyway. Maybe not for the content _you_ consume, but the Internet is a bit larger than our ivory tower. Check out, for example: https://www.rfc-editor.org/rse/wiki/lib/exe/fetch.php?media=design:future-unpag-20130820.txt In particular, when you look at that document, tell me what you think the parenthetical phrase after the author's name is supposed to look like -- because I can guarantee that Firefox isn't doing the right thing here. And again, it has already been pointed out we cannot scan the entire byte stream Sure we can. We just can't fix things on the fly: we'd need something akin to a user prompt and probably a page reload. Which is what I'm proposing. -- Adam Roach Principal Platform Engineer a...@mozilla.com +1 650 903 0800 x863 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 2013-08-30 3:17 PM, Adam Roach wrote: On 8/30/13 14:11, Adam Roach wrote: ...helping the user understand why the headline they're trying to read renders as "Ð' Ð"оÑ?дÑfме пÑEURедложили оÑ,обÑEURаÑ,ÑOE "Ð?обелÑ?" Ñf Ðz(бамÑ< " rather than "? ??? ?? "??" ? ?". Well, *there's* a heavy dose of irony in the context of this thread. I wonder what rules our mailing list server applies for character set decimation. When I sent that out, the question marks were a perfectly readable string of Cyrillic characters. Which provides a strong object lesson in the fact that character set configuration is hard. If we can't get this right internally, I think we've lost the moral ground in saying that others should be able to, and tough luck if they can't. For what it's worth, the original came through Thunderbird as a perfectly legitimate string of Russian at my end: ??? ?? ? , ??? ?? ??? , ??? ?? ?? ? ??. ??? ?? ???, ??? ?? ??? ?? , ?? ??? ?? ? ??? ? ???. - mhoye ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 8/30/13 12:24, Mike Hoye wrote: On 2013-08-30 11:17 AM, Adam Roach wrote: It seems to me that there's an important balance here between (a) letting developers discover their configuration error and (b) allowing users to render misconfigured content without specialized knowledge. For what it's worth Internet Explorer handled this (before UTF-8 and caring about JS performance were a thing) by guessing what encoding to use, comparing a letter-frequency-analysis of a page's content to a table of what bytes are most common in which in what encodings of whatever languages. ... From both the developer and user perspectives, it was amounted to "something went wrong because of bad magic." I'd like to clarify two points about what I'm proposing. First, I'm not proposing that we do anything without explicit user intervention, other than present an unobtrusive bar helping the user understand why the headline they're trying to read renders as "Ð' Ð"оÑ?дÑfме пÑEURедложили оÑ,обÑEURаÑ,ÑOE "Ð?обелÑ?" Ñf Ðz(бамÑ< " rather than "? ??? ?? "??" ? ?". (No political statement intended here -- that's just the leading headline on Pravda at the moment). If the user is happy with the encoding, they do nothing and go about their business. If the user determines that the rendering is, in fact, not what they want, they can simply click on the "Yes" button and (with high probability), everything is right with the world again. Also note that I'm not proposing that we try to do generic character set and language detection. That's fraught with the perils you cite. The topic we're discussing here is UTF-8, which can be easily detected with extremely high confidence. -- Adam Roach Principal Platform Engineer a...@mozilla.com +1 650 903 0800 x863 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 8/30/13 14:11, Adam Roach wrote: ...helping the user understand why the headline they're trying to read renders as "Ð' Ð"оÑ?дÑfме пÑEURедложили оÑ,обÑEURаÑ,ÑOE "Ð?обелÑ?" Ñf Ðz(бамÑ< " rather than "? ??? ?? "??" ? ?". Well, *there's* a heavy dose of irony in the context of this thread. I wonder what rules our mailing list server applies for character set decimation. When I sent that out, the question marks were a perfectly readable string of Cyrillic characters. Which provides a strong object lesson in the fact that character set configuration is hard. If we can't get this right internally, I think we've lost the moral ground in saying that others should be able to, and tough luck if they can't. /a ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧 wrote: > The problem I have with this approach is that it assumes that the page is > authored by someone who definitively knows the charset, which is not a > scenario which universally holds. Suppose you have a page that serves up the > contents of a plain text file, so your source data has no indication of its > charset. What charset should the page report? The choice is between guessing > (presumably UTF-8) or saying nothing (which causes the browser to guess > Windows-1252, generally). Where did the text file come from? There's a source somewhere... And these days that's hardly how people create content anyway. And again, it has already been pointed out we cannot scan the entire byte stream (since text/plain uses the HTML parser it goes for that too, unless we make an exception I suppose, but what data supports that?), which would make the situation worse. -- http://annevankesteren.nl/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 8/30/2013 4:01 AM, Anne van Kesteren wrote: On Fri, Aug 30, 2013 at 9:40 AM, Gervase Markham wrote: We don't want people to try and move to UTF-8, but move back because they haven't figured out how (or are technically unable) to label it correctly and "it comes out all wrong". You also don't want it to be wrong half of the time. Given that full content scans won't fly (we try to restrict scanning for encodings as much as possible), that's a very real possibility, especially given forums such as in OP that are mostly ASCII. Labeling is what people ought to do, and it's very easy: (if all other files end up unlabeled, they'll inherit from this one). The problem I have with this approach is that it assumes that the page is authored by someone who definitively knows the charset, which is not a scenario which universally holds. Suppose you have a page that serves up the contents of a plain text file, so your source data has no indication of its charset. What charset should the page report? The choice is between guessing (presumably UTF-8) or saying nothing (which causes the browser to guess Windows-1252, generally). -- Joshua Cranmer Thunderbird and DXR developer Source code archæologist ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 6:31 PM, Chris Peterson wrote: > Is there a less error-prone default we can recommend to Linux distribution > packagers? Maybe we can squelch the problem upstream instead of adding > browser hacks. The number of web server and distro packagers we would need > to reach out to is probably pretty small. The least error prone is not having a default. That way HTTP does not override content and works. Understanding of HTTP is severely limited so it being authoritative is kind of a problem here. -- http://annevankesteren.nl/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 8/30/13 3:03 AM, Henri Sivonen wrote: Telemetry data suggests that these days the more common reason for seeing mojibake is that there is an encoding declaration but it is wrong. My guess is that this arises from Linux distributions silently changing their Apache defaults to send a charset parameter in ContentType on the theory that it's good for security to send one even if the person packaging Apache logically can have no clue of what the value of the parameter should be for a specific deployment. (I think we should not start second guessing encoding declarations.) Is there a less error-prone default we can recommend to Linux distribution packagers? Maybe we can squelch the problem upstream instead of adding browser hacks. The number of web server and distro packagers we would need to reach out to is probably pretty small. chris ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 2013-08-30 11:17 AM, Adam Roach wrote: It seems to me that there's an important balance here between (a) letting developers discover their configuration error and (b) allowing users to render misconfigured content without specialized knowledge. For what it's worth Internet Explorer handled this (before UTF-8 and caring about JS performance were a thing) by guessing what encoding to use, comparing a letter-frequency-analysis of a page's content to a table of what bytes are most common in which in what encodings of whatever languages. It's probably not a suitable approach in modernity, because of performance problems and horrible-though-rare edge cases. If whatever you'd written turned out to have an unusual letter frequency or (worse) when a comment added to your badly-written CMS tripped that switch, your previously-Korean page would suddenly and magically start rendering in Hebrew or something, and unless you knew something about character encoding in IE it was basically impossible to figure out what had gone wrong or why. From both the developer and user perspectives, it was amounted to "something went wrong because of bad magic." - mhoye ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 8/30/13 05:08, Nicholas Nethercote wrote: On Fri, Aug 30, 2013 at 8:03 PM, Henri Sivonen wrote: I think we should encourage Web authors to use UTF-8 *and* to *declare* it. I'm no expert on this stuff, but Henri's point sure sound sensible to me. It seems to me that there's an important balance here between (a) letting developers discover their configuration error and (b) allowing users to render misconfigured content without specialized knowledge. Both of these are valid concerns, and I'm afraid that we're not assigning enough weight to the user perspective. I think we can find some middle ground here, where we help developers discover their misconfiguration, while also handing users the tool they need to fix it. Maybe an unobtrusive bar (similar to the password save bar) that says something like: "This page's character encoding appears to be mislabeled, which might cause certain characters to display incorrectly. Would you like to reload this page as Unicode? [Yes] [No] [More Information] [x]". -- Adam Roach Principal Platform Engineer a...@mozilla.com +1 650 903 0800 x863 ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 4:31 PM, Aryeh Gregor wrote: > In particular, you need to decide on the encoding before you start > running any user script, because you don't want document.characterSet > etc. to change once it might have already been accessed. For > performance reasons, we want to be able to run scripts immediately > after receiving the initial TCP response, if there are any to run yet. > This implies we need to decide on character set after reading the > first segment, which typically will not contain the actual content of > the page that we would want to sniff on pages like > http://www.eyrie-productions.com/. Right? Right. > (I say this only because my initial reaction was that we could hold > off on deciding what encoding to use until we find the first non-ASCII > byte without any ill effects, if we really wanted to. That would > probably make the site in question work. But then I realized it would > break document.characterSet, so it's not an option even if we wanted > more sniffing.) Right. The idea occurred to me, too, and then I thought of scripts and styles. -- Henri Sivonen hsivo...@hsivonen.fi http://hsivonen.iki.fi/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 1:03 PM, Henri Sivonen wrote: > This is true if you run the heuristic over the entire byte stream. > Unfortunately, since we support incremental loading of HTML (and will > have to continue to do so), we don't have the entire byte stream > available at the time when we need to make a decision of what encoding > to assume. In particular, you need to decide on the encoding before you start running any user script, because you don't want document.characterSet etc. to change once it might have already been accessed. For performance reasons, we want to be able to run scripts immediately after receiving the initial TCP response, if there are any to run yet. This implies we need to decide on character set after reading the first segment, which typically will not contain the actual content of the page that we would want to sniff on pages like http://www.eyrie-productions.com/. Right? (I say this only because my initial reaction was that we could hold off on deciding what encoding to use until we find the first non-ASCII byte without any ill effects, if we really wanted to. That would probably make the site in question work. But then I realized it would break document.characterSet, so it's not an option even if we wanted more sniffing.) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 8:03 PM, Henri Sivonen wrote: > > I think we should encourage Web authors to use UTF-8 *and* to *declare* it. I'm no expert on this stuff, but Henri's point sure sound sensible to me. Nick ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Thu, Aug 29, 2013 at 9:41 PM, Zack Weinberg wrote: > All the discussion of fallback character encodings has reminded me of an > issue I've been meaning to bring up for some time: As a user of the en-US > localization, nowadays the overwhelmingly most common situation where I see > mojibake is when a site puts UTF-8 in its pages without declaring any > encoding at all (neither via nor Content-Type). Telemetry data suggests that these days the more common reason for seeing mojibake is that there is an encoding declaration but it is wrong. My guess is that this arises from Linux distributions silently changing their Apache defaults to send a charset parameter in ContentType on the theory that it's good for security to send one even if the person packaging Apache logically can have no clue of what the value of the parameter should be for a specific deployment. (I think we should not start second guessing encoding declarations.) > It is > possible to distinguish UTF-8 from most legacy encodings heuristically with > high reliability, and I'd like to suggest that we ought to do so, > independent of locale. This is true if you run the heuristic over the entire byte stream. Unfortunately, since we support incremental loading of HTML (and will have to continue to do so), we don't have the entire byte stream available at the time when we need to make a decision of what encoding to assume. > Having read through a bunch of the "fallback encoding is wrong" bugs Henri's > been filing, I have the impression that Henri would prefer we *not* detect > UTF-8 Correct. Every time a localization sets the fallback to UTF-8 or a heuristic detector detects unlabeled UTF-8 is an opportunity for Web authors to generate a new legacy of unlabeled UTF-8 content thinking that everything is okay. > 1. There exist sites that still regularly add new, UTF-8-encoded content, > but whose *structure* was laid down in the late 1990s or early 2000s, > declares no encoding, and is unlikely ever to be updated again. The example > I have to hand is > http://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded > ; many other posts on this forum have the same problem. Take note of the > vintage HTML. I suggested to the admins of this site that they add charset="utf-8"> to the master page template, and was told that no one > involved in current day-to-day operations has the necessary access > privileges. I suspect that this kind of situation is rather more common than > we would like to believe. It's easy to have an anecdotal single data point of something on the Web are being broken. Is there any data on how common this problem is relative to other legacy encoding phenomena? > 2. For some of the fallback-encoding-is-wrong bugs still open, a binary > UTF-8/unibyte heuristic would save the localization from having to choose > between displaying legacy minority-language content correctly and displaying > legacy hegemonic-language content correctly. If I understand correctly, this > is the case at least for Welsh: > https://bugzilla.mozilla.org/show_bug.cgi?id=844087 . If we hadn't been defaulting to UTF-8 in any localization at any point, the minority-language unlabeled UTF-8 legacy would not have had a chance to develop. It's terrible that after having made the initial mistake of letting unlabeled non-UTF-8 legacy to develop, the mistake has been repeated for some localizations to allow a legacy of unlabeled UTF-8 to develop. We might still have a chance of stopping the new legacy of unlabeled UTF-8 from developing. > 3. Files loaded from local disk have no encoding metadata from the > transport, and may have no in-band label either; in particular, UTF-8 plain > text with no byte order mark, which is increasingly common, should not be > misidentified as the legacy encoding. When accessing the local disk, it might indeed make sense to examine all the bytes of the file before starting parsing. > Having a binary UTF-8/unibyte > heuristic might address some of the concerns mentioned in the "File API > should not use 'universal' character detection" bug, > https://bugzilla.mozilla.org/show_bug.cgi?id=848842 . I think in the case of the File API, we should just implement what the spec says and assume UTF-8. I think it's reprehensible that we have pulled non-spec magic out of thin air here. > If people are concerned about "infecting" the modern platform with > heuristics, perhaps we could limit application of the heuristic to quirks > mode, for HTML delivered over HTTP. I'm not particularly happy about the prospect of having to change the order of the quirkiness determination and the encoding determination. On Fri, Aug 30, 2013 at 11:40 AM, Gervase Markham wrote: > That seems wise to me, on gut instinct. It looks to me that it was gut instinct that led to stuff like the Esperanto locale setting the fallback to UTF-8 thereby making the locale top the list of character encoding overwr
Re: Detection of unlabeled UTF-8
On Fri, Aug 30, 2013 at 9:40 AM, Gervase Markham wrote: > We don't want people to try and move to UTF-8, but move back because > they haven't figured out how (or are technically unable) to label it > correctly and "it comes out all wrong". You also don't want it to be wrong half of the time. Given that full content scans won't fly (we try to restrict scanning for encodings as much as possible), that's a very real possibility, especially given forums such as in OP that are mostly ASCII. Labeling is what people ought to do, and it's very easy: (if all other files end up unlabeled, they'll inherit from this one). -- http://annevankesteren.nl/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On 29/08/13 19:41, Zack Weinberg wrote: > All the discussion of fallback character encodings has reminded me of an > issue I've been meaning to bring up for some time: As a user of the > en-US localization, nowadays the overwhelmingly most common situation > where I see mojibake is when a site puts UTF-8 in its pages without > declaring any encoding at all (neither via nor > Content-Type). It is possible to distinguish UTF-8 from most legacy > encodings heuristically with high reliability, and I'd like to suggest > that we ought to do so, independent of locale. That seems wise to me, on gut instinct. If the web is moving to UTF-8, and we are trying to encourage that, then it seems we should expect that this is what we get unless there are hints that we are wrong, whether that's the TLD, the statistical profile of the characters, or something else. We don't want people to try and move to UTF-8, but move back because they haven't figured out how (or are technically unable) to label it correctly and "it comes out all wrong". Gerv ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Detection of unlabeled UTF-8
On Thu, Aug 29, 2013 at 7:41 PM, Zack Weinberg wrote: > If people are concerned about "infecting" the modern platform with > heuristics, perhaps we could limit application of the heuristic to quirks > mode, for HTML delivered over HTTP. I expect this would cover the majority > of the sites described under point 1, and probably 2 as well. We should not introduce new heuristics. We could maybe introduce a new algorithm to the platform, but only if there's buy-in across the board. Given how fast utf-8 rises I'm not sure it's worth the effort though for the couple of sites that might be helped. -- http://annevankesteren.nl/ ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform