Re: Detection of unlabeled UTF-8

Gervase Markham Fri, 30 Aug 2013 01:41:09 -0700

On 29/08/13 19:41, Zack Weinberg wrote:
> All the discussion of fallback character encodings has reminded me of an
> issue I've been meaning to bring up for some time: As a user of the
> en-US localization, nowadays the overwhelmingly most common situation
> where I see mojibake is when a site puts UTF-8 in its pages without
> declaring any encoding at all (neither via <meta charset> nor
> Content-Type). It is possible to distinguish UTF-8 from most legacy
> encodings heuristically with high reliability, and I'd like to suggest
> that we ought to do so, independent of locale.


That seems wise to me, on gut instinct. If the web is moving to UTF-8,
and we are trying to encourage that, then it seems we should expect that
this is what we get unless there are hints that we are wrong, whether
that's the TLD, the statistical profile of the characters, or something
else.

We don't want people to try and move to UTF-8, but move back because
they haven't figured out how (or are technically unable) to label it
correctly and "it comes out all wrong".

Gerv


_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to