Hi all,

In the distant past, SpiderMonkey APIs consumed source text as two-byte UCS-2 
or one-byte |const char*|.  Was one-byte text ASCII?  UTF-8?  EBCDIC?  
Something else?  Who could say; no one thought about text encodings then.  *By 
happenstance* one-byte JS text was Latin-1: a byte is a code point.  And so 
lots of people used Latin-1 for JS purely because SpiderMonkey's carelessness 
made it easy.

SpiderMonkey's UTF-8 source support is far better and clearer now.  Most 
single-byte source users use UTF-8.  So I'm changing the remaining Gecko 
Latin-1 users to UTF-8.  The following scripts/script loaders now use 
exclusively UTF-8:

* JS components/modules (bug 1492932)
* subscripts via mozIJSSubScriptLoader.loadSubScript{,WithOptions} (bug 1492937)
* mochitest-browser scripts, because they're subscripts (bug 1492937)
* SJS scripts executed by httpd.js, because they're subscripts (bug 1513152, 
bug 1492937) [0]

Also, proxy autoconfig scripts may now be valid UTF-8 (bug 1492938).  (For 
compatibility reasons, invalid UTF-8 is treated as Latin-1, by inflating to 
UTF-16 and compiling that.)

Every affected script in the tree used UTF-8, so this just makes reality match 
expectation.  But it sometimes changes behavior and may affect patch backports:

* You may use non-ASCII code points directly in scripts (even outside comments) 
without needing escape sequences.
* If you *intend* to construct a string of the constituent UTF-8 code units of 
a non-ASCII code point, you must use hexadecimal escapes: "\xF0\x9F\x92\xA9".

Another step toward fewer text encodings.  \o/

Jeff

0. Note that until bug 1514075 lands, SJS scripts used in Android test runs 
will be interpreted as Latin-1 there (and only there).  Hopefully we can fix 
that quickly!
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to