Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

Henri Sivonen Mon, 10 Dec 2018 04:28:17 -0800

(Note: This isn't really a Web-exposed feature, but this is a Web
developer-exposed feature.)


# Summary

Autodetect UTF-8 when loading HTML or plain text from file: URLs (only!).

Some Web developers like to develop locally from file: URLs (as
opposed to local HTTP server) and then deploy using a Web server that
declares charset=UTF-8. To get the same convenience as when developing
with Chrome, they want the files loaded from file: URLs be treated as
UTF-8 even though the HTTP header isn't there.

Non-developer users save files from the Web verbatim without the HTTP
headers and open the files from file: URLs. These days, those files
are most often in UTF-8 and lack the BOM, and sometimes they lack
<meta charset=utf-8>, and plain text files can't even use <meta
charset=utf-8>. These users, too, would like a Chrome-like convenience
when opening these files from file: URLs in Firefox.

# Details

If a HTML or plain text file loaded from a file: URL does not contain
a UTF-8 error in the first 50 MB, assume it is UTF-8. (It is extremely
improbable for text intended to be in a non-UTF-8 encoding to look
like valid UTF-8 on the byte level.) Otherwise, behave like at
present: assume the fallback legacy encoding, whose default depends on
the Firefox UI locale.

The 50 MB limit exists to avoid buffering everything when loading a
log file whose size is on the order of a gigabyte. 50 MB is an
arbitrary size that is significantly larger than "normal" HTML or text
files, so that "normal"-sized files are examined with 100% confidence
(i.e. the whole file is examined) but can be assumed to fit in RAM
even on computers that only have a couple of gigabytes of RAM.

The limit, despite being arbitrary, is checked exactly to avoid
visible behavior changes depending on how Necko chooses buffer
boundaries.

The limit is a number of bytes instead of a timeout in order to avoid
reintroducing timing dependencies (carefully removed in Firefox 4) to
HTML parsing--even for file: URLs.

Unless a <meta> declaring the encoding (or a BOM) is found within the
first 1024 bytes, up to 50 MB of input is buffered before starting
tokenizing. That is, the feature assumes that local files don't need
incremental HTML parsing, that local file streams don't stall as part
of their intended operation, and that the content of local files is
available in its entirety (approximately) immediately.

There are counter examples like Unix FIFOs (can be infinite and can
stall for an arbitrary amount of time) or file server shares mounted
as if they were local disks (data available somewhat less
immediately). It is assumed that it's OK to require people who have
built workflows around Unix FIFOs to use <meta charset> and that it's
OK to potentially start rendering a little later when file: URLs
actually cause network access.

UTF-8 autodetection is given lower precedence that all other signals
that are presently considered for file: URLs. In particular, if a
file:-URL HTML document frames another file: URL HTML document (i.e.
they count as same-origin), the child inherits the encoding from the
parent instead of UTF-8 autodetection getting applied in the child
frame.

# Why file: URLs only

The reason why the feature does not apply to http: or https: resources
is that in those cases, it really isn't OK to assume that all bytes
arrive so quickly as to not benefit from incremental rendering and it
isn't OK to assume that the stream doesn't intentionally stall.

Applying detection to http: or https: resources would mean at least on
of the following compromises:

* Making the detection unreliable by making it depend on non-ASCII
appearing in the first 1024 bytes (the number of bytes currently
buffered for scanning <meta>). If the <title> was always near the
start of the file and the natural language used a non-Latin script to
make non-ASCII in the <title> a certainty, this solution would be
reliable. However, this solution would be particularly bad for
Latin-script languages with infrequent non-ASCII, such as Finnish or
German, which can legitimately have all-ASCII titles despite the
language as a whole including non-ASCII. That is, if a developer
tested a site with a title that has some non-ASCII, things would
appear to work, but then the site would break when an all-ASCII title
occurs.

* Making results depend on timing. (Having a detection timeout would
make the results depend on network performance relative to wall-clock
time.)

* Making the detection unreliable by examining only the first buffer
passed by the networking subsystem to the HTML parser. This makes the
result dependent on network buffer boundaries (*and* potentially
timing to the extent timing affects the boundaries), which is
unreliable. Prior to Firefox 4, HTML parsing in Firefox depended on
network buffer boundaries, which was bad and was remedied in Firefox
4. According to
https://github.com/whatwg/encoding/issues/68#issuecomment-272993181 ,
Chrome chooses this mode of badness.

* Breaking incremental rendering. (Not acceptable for remote content
for user-perceived performance reasons.) This is what the solution for
file: URLs does on the assumption that it's OK, because the data in
its entirety is (approximately) immediately available.

* Causing reloads. This is the mode of badness that applies when our
Japanese detector is in use and the first 1024 aren't enough to make
the decision.

All of these are bad. It's better to make the failure to declare UTF-8
in the http/https case something that the Web developer obviously has
to fix (by adding <meta>, HTTP header or the BOM) than to make it
appear that things work when actually at least one of the above forms
of badness applies.

# Bug

https://bugzilla.mozilla.org/show_bug.cgi?id=1071816

# Link to standard

https://html.spec.whatwg.org/#determining-the-character-encoding step
7 is basically an "anything goes" step for legacy reasons--mainly to
allow Japanese encoding detection that IE, WebKit and Gecko had before
the spec was written. Chrome started detecting more without prior
standard-setting discussion. See
https://github.com/whatwg/encoding/issues/68 for after-the-fact
discussion.

# Platform coverage

All

# Estimated or target release

66

# Preference behind which this will be implemented

Not planning to have a pref for this.

# Is this feature enabled by default in sandboxed iframes?

This is implemented to apply to all non-resource:-URL-derived file:
URLs, but since same-origin inheritance to child frames takes
precedence, this isn't expected to apply to sandboxed iframes in
practice.

# DevTools bug

No new dev tools integration. The pre-existing console warning about
undeclared character encoding will be shown still in the autodetection
case.

# Do other browser engines implement this

Chrome does, but not with the same number of bytes examined.

Safari as of El Capitan (my Mac is stuck on El Capitan) doesn't.

Edge as of Windows 10 1803 doesn't.

# web-platform-tests

As far as I'm aware, WPT doesn't cover file: URL behavior, and there
isn't a proper spec for this. Hence, unit tests use mochitest-chrome.

# Is this feature restricted to secure contexts?

Restricted to file: URLs.

-- 
Henri Sivonen
[email protected]
_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform

Intent to implement and ship: UTF-8 autodetection for HTML and plain text loaded from file: URLs

Reply via email to