Re: Detection of unlabeled UTF-8

Joshua Cranmer 🐧 Mon, 02 Sep 2013 11:41:05 -0700

On 8/30/2013 1:41 PM, Anne van Kesteren wrote:

On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧 <pidgeo...@gmail.com> wrote:

The problem I have with this approach is that it assumes that the page is
authored by someone who definitively knows the charset, which is not a
scenario which universally holds. Suppose you have a page that serves up the
contents of a plain text file, so your source data has no indication of its
charset. What charset should the page report? The choice is between guessing
(presumably UTF-8) or saying nothing (which causes the browser to guess
Windows-1252, generally).

Where did the text file come from?

The example I have in mind is something like MXR. The text file is some"external" source (say, a file in some source repository).

There's a source somewhere... And
these days that's hardly how people create content anyway.

I would guess that most content these days does not consist of staticpages but rather dynamically-generated content that is amalgamated fromseveral databases of various kinds. These sources don't necessarilyannotate their text with their charset (indeed, the entire problem we'rediscussing is due to people not annotating text with its charset). Iknow of at least one blog where the comments (and only the comments) getmojibake'd (UTF-8->ISO-8859-1->UTF-8), and I recall in the past seeingan RSS feed that got double-mojibake'd(UTF-8->ISO-8859-1->UTF-8->ISO-8859-1->UTF-8). Those examples aren'tsomething the browser can fix, but it should make clear that authorshave much less control (and/or knowledge) over the source charsets oftheir data than you would expect.

And again,
it has already been pointed out we cannot scan the entire byte stream
(since text/plain uses the HTML parser it goes for that too, unless we
make an exception I suppose, but what data supports that?), which
would make the situation worse.

I don't think there *is* a sane approach that satisfies everybody.Either you break "UTF8-just-works-everywhere", you break legacy content,you make parsing take inordinate times... or you might be find a happymedium if you're willing to make document.charset lie. :-)


--
Joshua Cranmer
Thunderbird and DXR developer
Source code archæologist

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to