On Sun, Mar 18, 2018 at 10:29 AM, Mike Samuel <mikesam...@gmail.com> wrote:

> Does this mean that the language below would need to be fixed at a
> specific version of Unicode or that we would need to cite a specific
> version for
> canonicalization but might allow a higher version for 
> String.prototype.normalize
> and in future versions of the spec require it?
>
> http://www.ecma-international.org/ecma-262/6.0/#sec-conformance
> """
> A conforming implementation of ECMAScript must interpret source text input
> in conformance with the Unicode Standard, Version 5.1.0 or later
> """
>
> and in ECMA 404
> <http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf>
>
> """
> For undated references, the latest edition of the referenced document
> (including any amendments) applies. ISO/IEC 10646, Information Technology โ€“
> Universal Coded Character Set (UCS) The Unicode Consortium. The Unicode
> Standard http://www.unicode.org/versions/latest.
> """
>

I can't see why either would have to change. JSON canonicalization should
produce a JSON text in UTF-8, using JSON escape sequences only for double
quote, backslash, and ASCII control characters U+0000 through U+001F (which
are not valid in JSON strings) and unpaired surrogates U+D800 through
U+DFFF (which are not conforming UTF-8). The algorithm doesn't need to know
whether any given code point has a UCS assignment.

Code points include orphaned surrogates in a way that scalar values do not,
> right?  So both "\uD800" and "\uD800\uDC00" are single codepoints.
> It seems like a strict prefix of a string should still sort before that
> string but prefix transitivity in general does not hold: "\uFFFF" <
> "\uD800\uDC00" && "\uFFFF" > "\uD800".
> That shouldn't cause problems for hashability but I thought I'd raise it
> just in case.
>

IMO, "\uD800\uDC00" should never be emitted because a proper
canonicalization would be "๐€€" (character sequence U+0022 QUOTATION MARK,
U+10000 LINEAR B SYLLABLE B008 A, U+0022 QUOTATION MARK; octet sequence
0x22, 0xF0, 0x90, 0x80, 0x80, 0x22).

As for sorting, using the represented code points makes sense to me, but is
not the only option (e.g., another option is using the literal characters
of the JSON text such that "Z" < "\"" < "\\" < "\u0000" < "\u001F" <
"\uD800" < "\uDC00" < "^" < "x" < "รค" < "๊ฐ€" < "๏ผก" < "๐Ÿ”ฅ" < "๐Ÿ™ƒ"). Any
specification of a total deterministic ordering would suffice, it's just
that some are less intuitive than others.

On Sun, Mar 18, 2018 at 10:30 AM, Anders Rundgren <
anders.rundgren....@gmail.com> wrote:

> On 2018-03-18 15:08, Richard Gibson wrote:
>
> In that they have the same goal, yes. In that they both achieve that goal,
> no. I'm not married to choices like exponential notation and uppercase
> escapes, but a JSON canonicalization scheme MUST cover all of JSON.
>
>
> Here it gets interesting...  What in JSON cannot be expressed through JS
> and JSON.stringify()?
>

JSON can express arbitrary numbers, but ECMAScript JSON.stringify is
limited to those with an exact IEEE 754 binary64 representation.

And probably more importantly (though not a gap with respect to JSON
specifically), it emits octet sequences that don't conform to UTF-8 when
serializing unpaired surrogates.

Certain scenarios call for different systems to _independently_ generate
> equivalent data structures, and it is a necessary property of canonical
> serialization that it yields identical results for equivalent data
> structures. JSON does not specify significance of object member ordering,
> so member ordering does not distinguish otherwise equivalent objects, so
> canonicalization MUST specify member ordering that is deterministic with
> respect to all valid data.
>
>
> Violently agree but do not understand (I guess I'm just dumb...) why (for
> example) sorting on UCS2/UTF-16 Code Units would not achieve the same goal
> (although the result would differ).
>

Any specification of a total deterministic ordering would suffice. Relying
upon 16-bit code units would impose a greater burden on systems that do not
use such representations internally, but is not fundamentally broken.
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to