Re: Intent to ship: accumulating most JS scripts' data as UTF-8, then directly parsing as UTF-8 (without inflating to UTF-16)

Jeff Walden Tue, 06 Aug 2019 11:25:35 -0700

On 8/6/19 8:16 AM, Anne van Kesteren wrote:
> On Sat, Jul 20, 2019 at 2:05 AM Jeff Walden <jwal...@mit.edu> wrote:
>> (*Only* valid UTF-8: any invalidity, including for WTF-8, is an immediate 
>> error, no replacement-character semantics applied.)
> 
> Wouldn't adding this allow you to largely bypass the text decoder if
> it identifies the content to be UTF-8? Meaning we'd have to scan and
> copy bytes even less?


In principle "yes"; in current reality "no".

First and foremost, as our script-loading/networking code is set up now there's 
an inevitable copy.  When we load an HTTP channel we process its data using a 
|NS_NewIncrementalStreamLoader| that ultimately invokes this signature:

NS_IMETHODIMP
ScriptLoadHandler::OnIncrementalData(nsIIncrementalStreamLoader* aLoader,
                                     nsISupports* aContext,
                                     uint32_t aDataLength, const uint8_t* aData,
                                     uint32_t* aConsumedLength) {

This function is *provided* a prefilled (probably by NSPR, ultimately by 
network driver/card?) buffer, then we use an |intl::Decoder| to 
decode-by-copying that buffer's bytes into the script loader's JS 
allocator-allocated buffer, reallocating and expanding it as necessary.  (The 
buffer must be JS-allocated because it's transferred to the JS engine for 
parsing/compilation/execution.  If you don't transfer a JS-owned buffer, 
SpiderMonkey makes a fresh copy anyway.)  To avoid a copy, you'd need to 
intersperse decoding (and buffer-expanding) code into the networking layer -- 
theoretically doable, practically tricky (especially if we assume the buffer is 
mindlessly filled by networking driver code).

Second -- alternatively -- if the JS engine literally processed raw code units 
of utterly unknown validity and networking code could directly fill in the 
buffer the JS engine would process --the JS engine would require additional 
changes to handle invalid code units.  UTF-16 already demands this because JS 
is natively WTF-16 (and all 16-bit sequences are WTF-16).  But all UTF-8 
processing code assumes validity now.  Extra effort would be required to handle 
invalidity.  We do a lot of keeping raw pointers/indexes into source units and 
then using them later -- think for things like atomizing identifiers -- and all 
those process-this-range-of-data operations would have to be modified.  
*Possible*?  Yes.  Tricky?  For sure.  (Particularly as many of those 
coordinates we end up baking into the final compiled script representation -- 
so invalidity wouldn't be a transient concern but one that would persistent 
indefinitely.)

Third and maybe most important if the practical considerations didn't exist: 
like every sane person, I'm leery of adding yet another place where arbitrary 
ostensibly-UTF-8 bytes are decoded with any sort of 
invalid-is-not-immediate-error semantics.  In a very distant past I fixed an 
Acid3 failure where we mis-implemented UTF-16 replacement-character semantics.  
https://bugzilla.mozilla.org/show_bug.cgi?id=421576  New implementations of 
replacement-character semantics are disproportionately risky.  In fact when I 
fixed that I failed to fix a separate UTF-16 decoding implementation that had 
to be consistent with it, introducing a security bug.  
https://bugzilla.mozilla.org/show_bug.cgi?id=489041#c9  *Some* of that problem 
was because replacement-character stuff was at the time under-defined, and now 
it's well-defined...but still.  Risk.  Once burned, twice shy.

In principle this is all solvable.  But it's all rather complicated and 
fraught.  We can probably get better and safer wins making improvements 
elsewhere.

Jeff
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Intent to ship: accumulating most JS scripts' data as UTF-8, then directly parsing as UTF-8 (without inflating to UTF-16)

Reply via email to