I don't see the standard allowing character encodings other than UTF-16 in strings. Section 8.4 says "When a String contains actual textual data, each element is considered to be a single UTF-16 code unit." This aligns with other normative references to UTF-16 in sections 2, 6, and 15.1.3. Section 8.4 does seem to allow the use of strings for non-textual data, but character encodings are by definition for characters, i.e., textual data.
Using a Unicode escape for non-textual data seems like abuse to me - Unicode is a character encoding standard. For Unicode, anything beyond six hex digits is excessive. Norbert On Jan 24, 2012, at 17:14 , Allen Wirfs-Brock wrote: > > On Jan 24, 2012, at 2:11 PM, Mark S. Miller wrote: > >> On Tue, Jan 24, 2012 at 12:33 PM, Allen Wirfs-Brock <[email protected]> >> wrote: >> Note that this proposal isn't currently under consideration for inclusion in >> ES.next, but the answer to you question is below >> [...] >> Just as the current definition of string specifies that a String is a >> sequence of 16-bit unsigned integer values, the proposal would specify that >> a String is a sequence of 32-bit unsigned integer values. In neither cause >> is it required that the individual String elements must be valid Unicode >> code point or code units. 8 hex digits are required to express a the full >> range of unsigned 32-bit integers. >> >> Why 32? Unicode has only 21 bits of significance. Since we don't expect >> strings to be stored naively (taking up 4x the space that would otherwise be >> allocated), > I believe most current implementation actually store 16-bits per characters > so it would be 2x rather than 4x >> > >> I don't see the payoff from choosing the next power of 2. The other choices >> I see are a) 21 bits, b) 53 bits, or c) unbounded. > > The current 16-bit character strings are sometimes uses to store non-Unicode > binary data and can be used with non-Unicode character encoding with up to > 16-bit chars. 21 bits is sufficient for Unicode but perhaps is not enough > for other useful encodings. 32-bit seems like a plausable unit. > > The real controversy that developed over this proposal regarded whether or > not every individual Unicode characters needs to be uniformly representable > as a single element of a String. This proposal took the position that they > should. Other voices felt that such uniformity was unnecessary and seem > content to expose UTF-8 or UTF-16. The argument was that applications may > have to look at multiple character logical units anyway, so dealing with UTF > encodings isn't much of an added burden. > > Allen > _______________________________________________ > es-discuss mailing list > [email protected] > https://mail.mozilla.org/listinfo/es-discuss _______________________________________________ es-discuss mailing list [email protected] https://mail.mozilla.org/listinfo/es-discuss

