Re: [Json] Response to Statement from W3C TAG

James Clark Mon, 09 Dec 2013 22:53:07 -0800

On Fri, Dec 6, 2013 at 2:51 AM, Allen Wirfs-Brock <[email protected]>wrote:


>
> The static semantics of a language are a set of rules that further
> restrict  which sequences of symbols form valid statements within the
> language.  For example, a rule that the 'member' names must be disjoint
> within an 'object' production could be a static semantic rule (however,
> there is intentionally no such rule in ECMA-404).
>
> The line between syntax and static semantics can be fuzzy.  Static
> semantic rules are typically used to express rules that cannot be
> technically expressed using the chosen syntactic formalism or rules which
> are simply inconvenient to express using that formalism.  For example, the
> editor of ECMA-404 chose to simplify the RR track expression of the JSON
> syntax by using static semantic rules for whitespace rather than
> incorporating them into RR diagrams.
>
> Another form of static semantic rules are equivalences that state when two
> or more different sequences of symbols must be considered as equivalent.
>  For example, the rules that state equivalencies between escape sequences
> and individual code points within an JSON 'string'.  Such equivalences are
> not strictly necessary at this level, but it it simplifies the
> specification of higher level semantics if equivalent symbol sequences can
> be normalized at this level of specification.
>


When we talk about the "semantics" of a language (rather than "static
> semantics") we are talking about attributing meaning (in some domain and
> context) to well-formed (as specified via syntax and static semantics)
> statements expressed in that language.
>
...

> What we can do, is draw a bright-line just above the level of static
> semantics.This is what ECMA-404 attempts to do.


I don't see how you can accommodate the second kind of static semantic rule
within the definition of conformance that you have chosen for ECMA-404.
Section 2 defines conformance in terms of whether a sequence of Unicode
code points conforms to the grammar.  This doesn't even accommodate the
first kind of static semantic rule, but it is obviously easy to extend it
so that it does.  However, to accommodate the second kind of static
semantic rule, you would need a notion of conformance that deals with how
conforming parsers interpret a valid sequence of code points.

I think it is coherent to draw a bright-line just above the first level of
static semantics.  If you did that, then most of the prose of section 9 (on
Strings) would have to be removed; but this would be rather inconvenient,
because most specifications of higher-level semantics would end up having
to specify it themselves.

However, I find it hard to see any bright-line above the second level of
static semantics and below semantics generally.  Let's consider section 9.
I would argue that this section should define a "semantics" for string
tokens, by defining a mapping from sequences of code points matching the
production _string_ (what I would call the "lexical space") into arbitrary
sequences of code points (what I would call the "value space"). The spec
sometimes seems to be doing this and sometimes seems to be doing something
more like your second kind of static semantics. Sometimes it uses the term
"code point" or "character" to refer to code points in the lexical space
("A string is a sequence of Unicode code points wrapped with quotation
marks"), and sometimes it uses those terms to refer to code points in the
value space ("Any code point may be represented as a hexadecimal number").
  You could redraft so that it was expressed purely in terms of code points
in the lexical space, but that would be awkward and unnatural: for example,
an hexadecimal escape would represent either one or two code points in the
lexical space.  Furthermore I don't see what you would gain by this.  Once
you talk about equivalences between sequences, you are into semantics and
you need a richer notion of conformance.

So back to "semantics" and why ECMA-404 tries (perhaps imperfectly) to
> avoid describing JSON beyond the level of static semantics.
>
> ECMA-404 see JSON as "a text format that facilitates structured data
> interchange between all programming languages. JSON
> is syntax of braces, brackets, colons, and commas that is useful in many
> contexts, profiles, and applications".
>
> There are many possible semantics and categories of semantics that can be
> applied to well-formed statements expressed using the JSON syntax.
>
...

>
> The problem with trying to standardize JSON semantics is that the various
> semantics that can be usefully be imposed upon JSON are often mutually
> incompatible with each other. At a trivial level, we see this with issues
> like the size of numbers or duplicate object member keys.  It is very hard
> to decide whose semantics are acceptable and whose is not.
>

I would argue that ECMA-404 should define the least restrictive reasonable
semantics: the semantics should not treat as identical any values that
higher layers might reasonably want to treat as distinct.  This is not the
one, true JSON semantics: it is merely a semantic layer on which other
higher-level semantic layers can in turn be built.  I don't think it's so
hard to define this:

1. a value is an object, array, number, string, boolean or null.
2. an object is an ordered sequence of <string, value> pairs
3. an array is an ordered sequence of values
4. a string is an ordered sequence of Unicode code points

Item 2 maybe surprising to some people, but there's not really much choice
given that JS preserves the order of object keys.  The difficult case is
number. But even with number, I would argue that there are clearly some
lexical values that can uncontroversially be specified to be equivalent
(for example, 1e1 with 1E1 or 1e1 with 1e+1).  A set of decisions on
lexical equivalence effectively determines a value space for numbers.  For
example, you might reasonably decide that two values are equivalent if they
represent real numbers with the same mathematical value.

If ECMA-404 doesn't provide such a semantic layer, it becomes quite
challenging for higher-level language bindings to specify their semantics
in a truly rigorous way.  Take strings for example.  I think by far the
cleanest way to rigorously define a mapping from string tokens to sequences
of code points is to have a BNF and a syntax-directed mapping as the
ECMAScript spec does very nicely in 7.8.4 (
http://www.ecma-international.org/ecma-262/5.1/#sec-7.8.4).  If ECMA-404
provides merely a syntax and a specification of string equivalence, it
becomes quite a challenge to draft a specification that somehow expresses
the mapping while still normatively relying on the ECMA-404 spec for the
syntax. What will happen in practice is that these higher level mapping
will not be specified rigorously.

I think ECMA-404 would be significantly more useful for its intended
purpose if it provided the kind of semantics I am suggesting.

I know XML is not very fashionable these days but we have a couple of
decades of experience with XML and SGML which I think do have some
relevance to a specification of "structured data interchange between
programming language".   One conclusion I would draw from this experience
is that the concept of an XML Infoset or something like it is very useful.
 Most users of XML deal with higher-level semantic abstractions rather than
directly with the XML Infoset, but it has proven very useful to be able to
specify these higher-level semantic abstractions in terms of the XML
Infoset rather than having to specify them directly in terms of the XML
syntax.  Another conclusion I would draw is that it would have worked much
better to integrate the XML Infoset specification into the main XML
specification.  The approach of having a separate XML Infoset specification
has meant that there is no proper rigorous specification how to map from
the XML syntax to the XML Infoset (it seems to be assumed to be so obvious
that it does not need stating).  I tried an integrated approach of
specifying the syntax and data model together in the MicroXML spec (
https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html), and I
think it works much better. The current approach of ECMA-404 is a bit like
that of the XML Recommendation: it pretends at times to be just specifying
when a sequence of code points is valid, and yet the specification contains
a fairly random selection of statements of how a valid sequence should be
interpreted.

James

_______________________________________________
es-discuss mailing list
[email protected]
https://mail.mozilla.org/listinfo/es-discuss

Re: [Json] Response to Statement from W3C TAG

Reply via email to