Re: std.data.json formal review

Sönke Ludwig via Digitalmars-d Sat, 22 Aug 2015 06:47:09 -0700

Am 21.08.2015 um 19:30 schrieb Andrei Alexandrescu:

On 8/18/15 12:54 PM, Sönke Ludwig wrote:

Am 18.08.2015 um 00:21 schrieb Andrei Alexandrescu:

* On the face of it, dedicating 6 modules to such a small specification
as JSON seems excessive. I'm thinking one module here. (As a simple
point: who would ever want to import only foundation, which in turn has
one exception type and one location type in it?) I think it shouldn't be
up for debate that we must aim for simple and clean APIs.


That would mean a single module that is >5k lines long. Spreading out
certain things, such as JSONValue into an own module also makes sense to
avoid unnecessarily large imports where other parts of the functionality
isn't needed. Maybe we could move some private things to "std.internal"
or similar and merge some of the modules?


That would help. My point is it's good design to make the response
proportional to the problem. 5K lines is not a lot, but reducing those
5K in the first place would be a noble pursuit. And btw saving parsing
time is so C++ :o).

Most lines are needed for tests and documentation. Surely dropping somefunctionality would make the module smaller, too. But there is not a lotto take away without making severe compromises in terms of actualfunctionality or usability.

But I also think that grouping symbols by topic is a good thing and
makes figuring out the API easier. There is also always package.d if you
really want to import everything.


Figuring out the API easily is a good goal. The best way to achieve that
is making the API no larger than necessary.

So, what's your suggestion, remove all read*/skip* functions forexample? Make them member functions of JSONParserRange instead of UFCSfunctions? We could of course also just use the pseudo modules thatstd.algorithm had for example, where we'd create a table in thedocumentation for each category of functions.

Another thing I'd like to add is an output range that takes parser nodes
and writes to a string output range. This would be the kind of interface
that would be most useful for a serialization framework.


Couldn't that be achieved trivially by e.g. using map!(t => t.toString)
or similar?

This is the nice thing about rangifying everything - suddenly you have a
host of tools at your disposal.


No, the idea is to have an output range like so:

    Appender!string dst;
    JSONNodeOutputRange r(&dst);
    r.put(beginArray);
    r.put(1);
    r.put(2);
    r.put(endArray);

This would provide a forward interface for code that has to directlyiterate over its input, which is the case for a serializer - it can'tprovide an input range interface in a sane way. The alternative would beto either let the serializer re-implement all of JSON, or to justprovide some primitives (writeJSON() that takes bool, number or string)and to let the serializer implement the rest of JSON (arrays/objects),which includes certain options, such as pretty-printing.

- Also, at token level strings should be stored with escapes unresolved.
If the user wants a string with the escapes resolved, a lazy range
does it.


To make things efficient, it currently stores escaped strings if slices
of the input are used, but stores unescaped strings if allocations are
necessary anyway.


That seems a good balance, and probably could be applied to numbers as
well.

With the difference that numbers stored as numbers never need toallocate, so for non-slicable inputs the compromise is not the same.


What about just offering basically three (CT selectable) modes:
- Always parse as double (parse lazily if slicing can be used) (default)
- Parse double or long (again, lazily if slicing can be used)
- Always store the string representation

The question that remains is how to handle this in JSONValue - supportjust double there? Or something like JSONNumber that abstracts away thedifferences, but makes writing generic code against JSONValue difficult?Or make it also parameterized in what it can store?

- Validating UTF is tricky; I've seen some discussion in this thread
about it. On the face of it JSON only accepts valid UTF characters. As
such, a modularity-based argument is to pipe UTF validation before
tokenization. (We need a lazy UTF validator and sanitizer stat!) An
efficiency-based argument is to do validation during tokenization. I'm
inclining in favor of modularization, which allows us to focus on one
thing at a time and do it well, instead of duplicationg validation
everywhere. Note that it's easy to write routines that do JSON
tokenization and leave UTF validation for later, so there's a lot of
flexibility in composing validation with JSONization.


It's unfortunate to see this change of mind in face of the work that
already went into the implementation. I also still think that this is a
good optimization opportunity that doesn't really affect the
implementation complexity. Validation isn't duplicated, but reused from
std.utf.


Well if the validation is reused from std.utf, it can't have been very
much work. I maintain that separating concerns seems like a good
strategy here.

There is more than the actual call to validate(), such as writing testsand making sure the surroundings work, adjusting the interface andwriting documentation. It's not *that* much work, but nonetheless wastedwork.

I also still think that this hasn't been a bad idea at all. Because itspeeds up the most important use case, parsing JSON from a non-memorysource that has not yet been validated. I also very much like the ideaof making it a programming error to have invalid UTF stored in a string,i.e. forcing the validation to happen before the cast from bytes to chars.

- Litmus test: if the input type is a forward range AND if the string
type chosen for tokens is the same as input type, successful
tokenization should allocate exactly zero memory. I think this is a
simple way to make sure that the tokenization API works well.


Supporting arbitrary forward ranges doesn't seem to be enough, it would
at least have to be combined with something like take(), but then the
type doesn't equal the string type anymore. I'd suggest to keep it to
"if is sliceable and input type equals string type", at least for the
initial version.


I had "take" in mind. Don't forget that "take" automatically uses slices
wherever applicable. So if you just use typeof(take(...)), you get the
best of all worlds.

The more restrictive version seems reasonable for the first release.


Okay.

- The JSON value does its own internal allocation (for e.g. arrays and
hashtables), which should be fine as long as it's encapsulated and we
can tweak it later (e.g. make it use reference counting inside).


Since it's based on (Tagged)Algebraic, the internal types are part of
the interface. Changing them later is bound to break some code. So AFICS
this would either require to make the types used parameterized (string,
array and AA types). Or to abstract them away completely, i.e. only
forward operations but deny direct access to the type.

... thinking about it, TaggedAlgebraic could do that, while Algebraic
can't.


Well if you figure the general Algebraic type is better replaced by a
type specialized for JSON, fine.

What we shouldn't endorse is two nearly identical library types
(Algebraic and TaggedAlgebraic) that are only different in subtle
matters related to performance in certain use patterns.

If integral tags are better for closed type universes, specialize
Algebraic to use integral tags where applicable.

TaggedAlgebraic would not be a type specialized for JSON! It's usefulfor all kinds of applications and just happens to have some advantageshere, too.


An (imperfect) idea for merging this with the existing Algebraic name:

template Algebraic(T)
    if (is(T == struct) || is(T == union))
{
        // ... implementation of TaggedAlgebraic ...
}

To avoid the ambiguity with a single type Algebraic, a UDA could berequired for T to get the actual TaggedAgebraic behavior.

Everything else would be problematic, because TaggedAlgebraic needs tobe supplied with names for the different types, so the Algebraic(T...)way of specifying allowed types doesn't really work. And, moreimportantly, because exploiting static type information in the generatedinterface means breaking code that currently is built around a Variantreturn value.

- Why both parseJSONStream and parseJSONValue? I'm thinking
parseJSONValue would be enough because then you trivially parse a stream
with repeated calls to parseJSONValue.


parseJSONStream is the pull parser (StAX style) interface. It returns
the contents of a JSON document as individual nodes instead of storing
them in a DOM. This part is vital for high-performance parsing,
especially of large documents.


So perhaps this is just a naming issue. The names don't suggest
everything you said. What I see is "parse a JSON stream" and "parse a
JSON value". So I naturally assumed we're looking at consuming a full
stream vs. consuming only one value off a stream and stopping. How about
better names?


parseToJSONValue/parseToJSONStream? parseAsX?

- readArray suddenly introduces a distinct kind of interacting -
callbacks. Why? Should be a lazy range lazy range lazy range. An adapter
using callbacks is then a two-liner.


It just has a more complicated implementation, but is already on the
TODO list.


Great. Let me say again that with ranges you get to instantly tap into a
wealth of tools. I say get rid of the callbacks and let a "tee" take
care of it for whomever needs it.

The callbacks would surely be dropped when ranges get available.foreach() should usually be all that is needed.

- Why is readBool even needed? Just readJSONValue and then enforce it as
a bool. Same reasoning applies to readDouble and readString.


This is for lower level access, using parseJSONValue would certainly be
possible, but it would have quite some unneeded overhead and would also
be non-@nogc.


Meh, fine. But all of this is adding weight to the API in the wrong places.

Frankly, I don't think that this is even the wrong place. The pullparser interface is the single most important part of the API when wetalk about allocation-less and high-performance operation. It alsoreally has low weight, as it's just a small function that joins theother read* functions quite naturally and doesn't create any additionalcognitive load.

- readObject is with callbacks again - it would be nice if it were a
lazy range.


Okay, is also already on the list.


Awes!

It could return a Tuple!(string, JSONNodeRange). But probably thereshould also be an opApply for the object field range, so that foreach(key, value; ...) becomes possible.

But apart from that, algebraic is unfortunately currently quite unsuited
for this kind of abstraction, even if that can be solved in theory (with
a lot of work). It requires to write things like
obj.get!(JSONValue[string])["foo"].get!JSONValue instead of just
obj["foo"], because it simply returns Variant from all of its forwarded
operators.


Algebraic does not expose opIndex. We could add it to Algebraic such
that obj["foo"] returns the same type a "this".


https://github.com/D-Programming-Language/phobos/blob/6df5d551fd8a21feef061483c226e7d9b26d6cd4/std/variant.d#L1088
https://github.com/D-Programming-Language/phobos/blob/6df5d551fd8a21feef061483c226e7d9b26d6cd4/std/variant.d#L1348

It's easy for anyone to say that what's there is unfit for a particular
purpose. It's also easy for many to define a ever-so-slightly-different
new artifact that fits a particular purpose. Where you come as a
talented hacker is to operate with the understanding of the importance
of making things work, and make it work.

The problem is that making Algebraic exploit static type informationmeans nothing short of a complete reimplementation, whichTaggedAlgebraic is. It also means breaking existing code, if, forexample, alg[0] suddenly returns a string instead of just a Variant witha string stored inside.

- JSONValue should be more opaque and not expose representation as much
as it does now. In particular, offering a built-in hashtable is bound to
be problematic because those are expensive to construct, create garbage,
and are not customizable. Instead, the necessary lookup and set APIs
should be provided by JSONValue whilst keeping the implementation
hidden. The same goes about array - a JSONValue shall not be exposed;
instead, indexed access primitives should be exposed. Separate types
might be offered (e.g. JSONArray, JSONDictionary) if deemed necessary.
The string type should be a type parameter of JSONValue.


This would unfortunately at the same time destroy almost all benefits
that using (Tagged)Algebraic has, namely that it would opens up the
possibility to have interoperability between different data formats (for
example, passing a JSONValue to a BSON generator without letting the
BSON generator know about JSON). This is unfortunately an area that I've
also not yet properly explored, but I think it's important as we go
forward with other data formats.


I think we need to do it. Otherwise we're stuck with "D's JSON API
cannot be used without the GC". We want to escape that gravitational
pull. I know it's hard. But it's worth it.

I can't fight the feeling that what Phobos currently has in terms ofallocators, containters and reference counting is simply not matureenough to make a good decision here. Restricting JSONValue as much aspossible would at least keep the possibility to extend it later, but Ithink that we can and should do better in the long term.

Re: std.data.json formal review

Reply via email to