On 8/30/2013 1:41 PM, Anne van Kesteren wrote:
On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧 <pidgeo...@gmail.com> wrote:
The problem I have with this approach is that it assumes that the page is
authored by someone who definitively knows the charset, which is not a
scenario which universally holds. Suppose you have a page that serves up the
contents of a plain text file, so your source data has no indication of its
charset. What charset should the page report? The choice is between guessing
(presumably UTF-8) or saying nothing (which causes the browser to guess
Windows-1252, generally).
Where did the text file come from?
The example I have in mind is something like MXR. The text file is some "external" source (say, a file in some source repository).
There's a source somewhere... And
these days that's hardly how people create content anyway.
I would guess that most content these days does not consist of static pages but rather dynamically-generated content that is amalgamated from several databases of various kinds. These sources don't necessarily annotate their text with their charset (indeed, the entire problem we're discussing is due to people not annotating text with its charset). I know of at least one blog where the comments (and only the comments) get mojibake'd (UTF-8->ISO-8859-1->UTF-8), and I recall in the past seeing an RSS feed that got double-mojibake'd (UTF-8->ISO-8859-1->UTF-8->ISO-8859-1->UTF-8). Those examples aren't something the browser can fix, but it should make clear that authors have much less control (and/or knowledge) over the source charsets of their data than you would expect.
And again,
it has already been pointed out we cannot scan the entire byte stream
(since text/plain uses the HTML parser it goes for that too, unless we
make an exception I suppose, but what data supports that?), which
would make the situation worse.
I don't think there *is* a sane approach that satisfies everybody. Either you break "UTF8-just-works-everywhere", you break legacy content, you make parsing take inordinate times... or you might be find a happy medium if you're willing to make document.charset lie. :-)

--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to