On 8/6/19 8:16 AM, Anne van Kesteren wrote: > On Sat, Jul 20, 2019 at 2:05 AM Jeff Walden <jwal...@mit.edu> wrote: >> (*Only* valid UTF-8: any invalidity, including for WTF-8, is an immediate >> error, no replacement-character semantics applied.) > > Wouldn't adding this allow you to largely bypass the text decoder if > it identifies the content to be UTF-8? Meaning we'd have to scan and > copy bytes even less?
In principle "yes"; in current reality "no". First and foremost, as our script-loading/networking code is set up now there's an inevitable copy. When we load an HTTP channel we process its data using a |NS_NewIncrementalStreamLoader| that ultimately invokes this signature: NS_IMETHODIMP ScriptLoadHandler::OnIncrementalData(nsIIncrementalStreamLoader* aLoader, nsISupports* aContext, uint32_t aDataLength, const uint8_t* aData, uint32_t* aConsumedLength) { This function is *provided* a prefilled (probably by NSPR, ultimately by network driver/card?) buffer, then we use an |intl::Decoder| to decode-by-copying that buffer's bytes into the script loader's JS allocator-allocated buffer, reallocating and expanding it as necessary. (The buffer must be JS-allocated because it's transferred to the JS engine for parsing/compilation/execution. If you don't transfer a JS-owned buffer, SpiderMonkey makes a fresh copy anyway.) To avoid a copy, you'd need to intersperse decoding (and buffer-expanding) code into the networking layer -- theoretically doable, practically tricky (especially if we assume the buffer is mindlessly filled by networking driver code). Second -- alternatively -- if the JS engine literally processed raw code units of utterly unknown validity and networking code could directly fill in the buffer the JS engine would process --the JS engine would require additional changes to handle invalid code units. UTF-16 already demands this because JS is natively WTF-16 (and all 16-bit sequences are WTF-16). But all UTF-8 processing code assumes validity now. Extra effort would be required to handle invalidity. We do a lot of keeping raw pointers/indexes into source units and then using them later -- think for things like atomizing identifiers -- and all those process-this-range-of-data operations would have to be modified. *Possible*? Yes. Tricky? For sure. (Particularly as many of those coordinates we end up baking into the final compiled script representation -- so invalidity wouldn't be a transient concern but one that would persistent indefinitely.) Third and maybe most important if the practical considerations didn't exist: like every sane person, I'm leery of adding yet another place where arbitrary ostensibly-UTF-8 bytes are decoded with any sort of invalid-is-not-immediate-error semantics. In a very distant past I fixed an Acid3 failure where we mis-implemented UTF-16 replacement-character semantics. https://bugzilla.mozilla.org/show_bug.cgi?id=421576 New implementations of replacement-character semantics are disproportionately risky. In fact when I fixed that I failed to fix a separate UTF-16 decoding implementation that had to be consistent with it, introducing a security bug. https://bugzilla.mozilla.org/show_bug.cgi?id=489041#c9 *Some* of that problem was because replacement-character stuff was at the time under-defined, and now it's well-defined...but still. Risk. Once burned, twice shy. In principle this is all solvable. But it's all rather complicated and fraught. We can probably get better and safer wins making improvements elsewhere. Jeff _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform