Re: [Json] Response to Statement from W3C TAG

Allen Wirfs-Brock Tue, 10 Dec 2013 18:06:58 -0800

On Dec 9, 2013, at 10:52 PM, James Clark wrote:

> On Fri, Dec 6, 2013 at 2:51 AM, Allen Wirfs-Brock <[email protected]> 
> wrote:
> 
> The static semantics of a language are a set of rules that further restrict  
> which sequences of symbols form valid statements within the language.  For 
> example, a rule that the 'member' names must be disjoint within an 'object' 
> production could be a static semantic rule (however, there is intentionally 
> no such rule in ECMA-404).
> 
> The line between syntax and static semantics can be fuzzy.  Static semantic 
> rules are typically used to express rules that cannot be technically 
> expressed using the chosen syntactic formalism or rules which are simply 
> inconvenient to express using that formalism.  For example, the editor of 
> ECMA-404 chose to simplify the RR track expression of the JSON syntax by 
> using static semantic rules for whitespace rather than incorporating them 
> into RR diagrams. 
> 
> Another form of static semantic rules are equivalences that state when two or 
> more different sequences of symbols must be considered as equivalent.  For 
> example, the rules that state equivalencies between escape sequences and 
> individual code points within an JSON 'string'.  Such equivalences are not 
> strictly necessary at this level, but it it simplifies the specification of 
> higher level semantics if equivalent symbol sequences can be normalized at 
> this level of specification.
>   
> When we talk about the "semantics" of a language (rather than "static 
> semantics") we are talking about attributing meaning (in some domain and 
> context) to well-formed (as specified via syntax and static semantics) 
> statements expressed in that language. 
> ... 
> What we can do, is draw a bright-line just above the level of static 
> semantics.This is what ECMA-404 attempts to do. 
> 
> I don't see how you can accommodate the second kind of static semantic rule 
> within the definition of conformance that you have chosen for ECMA-404. 
> Section 2 defines conformance in terms of whether a sequence of Unicode code 
> points conforms to the grammar.  This doesn't even accommodate the first kind 
> of static semantic rule, but it is obviously easy to extend it so that it 
> does.  However, to accommodate the second kind of static semantic rule, you 
> would need a notion of conformance that deals with how conforming parsers 
> interpret a valid sequence of code points.


Well, its certainly is a nit to pick, but in context I interpret the term 
"grammar" as used in clause 2 (and also the Introduction) as meaning the full 
normative content of clauses 4 to 9. This includes the actual CFG specification 
and the associated static semantic rules. 

The notion of a conforming parser could be added, I less sure that it is really 
necessary.  We don't even need to consider string escapes to get into the issue 
of equivalent JSON texts as it also exists because of optional white space.

> 
> I think it is coherent to draw a bright-line just above the first level of 
> static semantics.  If you did that, then most of the prose of section 9 (on 
> Strings) would have to be removed; but this would be rather inconvenient, 
> because most specifications of higher-level semantics would end up having to 
> specify it themselves.

I generally agree with this, including the convenience perspective.  It 
essentially also applies to the decimal interpretation of numbers.  There is an 
argument to be made that both should just be discussed informatively and leave 
to higher level semantic specs. to make those interpretation normative.

> 
> However, I find it hard to see any bright-line above the second level of 
> static semantics and below semantics generally.  Let's consider section 9. I 
> would argue that this section should define a "semantics" for string tokens, 
> by defining a mapping from sequences of code points matching the production 
> _string_ (what I would call the "lexical space") into arbitrary sequences of 
> code points (what I would call the "value space"). The spec sometimes seems 
> to be doing this and sometimes seems to be doing something more like your 
> second kind of static semantics. Sometimes it uses the term "code point" or 
> "character" to refer to code points in the lexical space ("A string is a 
> sequence of Unicode code points wrapped with quotation marks"), and sometimes 
> it uses those terms to refer to code points in the value space ("Any code 
> point may be represented as a hexadecimal number").   You could redraft so 
> that it was expressed purely in terms of code points in the lexical space, 
> but that would be awkward and unnatural: for example, an hexadecimal escape 
> would represent either one or two code points in the lexical space.  
> Furthermore I don't see what you would gain by this.  Once you talk about 
> equivalences between sequences, you are into semantics and you need a richer 
> notion of conformance.

Generally agree. We are probably seeing some editorial confusion as feedback 
(including mine)  was integrated into the editor's initial draft. This can all 
be improved in a subsequent edition

> 
> So back to "semantics" and why ECMA-404 tries (perhaps imperfectly) to avoid 
> describing JSON beyond the level of static semantics. 
> 
> ECMA-404 see JSON as "a text format that facilitates structured data 
> interchange between all programming languages. JSON
> is syntax of braces, brackets, colons, and commas that is useful in many 
> contexts, profiles, and applications".
> 
> There are many possible semantics and categories of semantics that can be 
> applied to well-formed statements expressed using the JSON syntax.
> ...
> 
> The problem with trying to standardize JSON semantics is that the various 
> semantics that can be usefully be imposed upon JSON are often mutually 
> incompatible with each other. At a trivial level, we see this with issues 
> like the size of numbers or duplicate object member keys.  It is very hard to 
> decide whose semantics are acceptable and whose is not.
> 
> I would argue that ECMA-404 should define the least restrictive reasonable 
> semantics: the semantics should not treat as identical any values that higher 
> layers might reasonably want to treat as distinct.  This is not the one, true 
> JSON semantics: it is merely a semantic layer on which other higher-level 
> semantic layers can in turn be built.  I don't think it's so hard to define 
> this:
> 
> 1. a value is an object, array, number, string, boolean or null.
> 2. an object is an ordered sequence of <string, value> pairs
> 3. an array is an ordered sequence of values
> 4. a string is an ordered sequence of Unicode code points

Indeed, this aligns very well with my perspective

> 
> Item 2 maybe surprising to some people, but there's not really much choice 
> given that JS preserves the order of object keys.  The difficult case is 
> number. But even with number, I would argue that there are clearly some 
> lexical values that can uncontroversially be specified to be equivalent (for 
> example, 1e1 with 1E1 or 1e1 with 1e+1).  A set of decisions on lexical 
> equivalence effectively determines a value space for numbers.  For example, 
> you might reasonably decide that two values are equivalent if they represent 
> real numbers with the same mathematical value.
> 
> If ECMA-404 doesn't provide such a semantic layer, it becomes quite 
> challenging for higher-level language bindings to specify their semantics in 
> a truly rigorous way.  Take strings for example.  I think by far the cleanest 
> way to rigorously define a mapping from string tokens to sequences of code 
> points is to have a BNF and a syntax-directed mapping as the ECMAScript spec 
> does very nicely in 7.8.4 
> (http://www.ecma-international.org/ecma-262/5.1/#sec-7.8.4).  If ECMA-404 
> provides merely a syntax and a specification of string equivalence, it 
> becomes quite a challenge to draft a specification that somehow expresses the 
> mapping while still normatively relying on the ECMA-404 spec for the syntax. 
> What will happen in practice is that these higher level mapping will not be 
> specified rigorously.
> 
> I think ECMA-404 would be significantly more useful for its intended purpose 
> if it provided the kind of semantics I am suggesting.
> 
> I know XML is not very fashionable these days but we have a couple of decades 
> of experience with XML and SGML which I think do have some relevance to a 
> specification of "structured data interchange between programming language".  
>  One conclusion I would draw from this experience is that the concept of an 
> XML Infoset or something like it is very useful.  Most users of XML deal with 
> higher-level semantic abstractions rather than directly with the XML Infoset, 
> but it has proven very useful to be able to specify these higher-level 
> semantic abstractions in terms of the XML Infoset rather than having to 
> specify them directly in terms of the XML syntax.  Another conclusion I would 
> draw is that it would have worked much better to integrate the XML Infoset 
> specification into the main XML specification.  The approach of having a 
> separate XML Infoset specification has meant that there is no proper rigorous 
> specification how to map from the XML syntax to the XML Infoset (it seems to 
> be assumed to be so obvious that it does not need stating).  I tried an 
> integrated approach of specifying the syntax and data model together in the 
> MicroXML spec 
> (https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html), and I 
> think it works much better. The current approach of ECMA-404 is a bit like 
> that of the XML Recommendation: it pretends at times to be just specifying 
> when a sequence of code points is valid, and yet the specification contains a 
> fairly random selection of statements of how a valid sequence should be 
> interpreted. 

Thank you, this is very useful feedback.  Would you mind submit this as a bug 
report against ECMA-404 at bugs.ecmascript.org ? I can do it, but community 
feedback is important and I'd like to to be on the CC list for the bug.

Allen


> 
> James
>

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: [Json] Response to Statement from W3C TAG

Reply via email to