Re: std.data.json formal review

via Digitalmars-d Tue, 18 Aug 2015 02:12:07 -0700

On Monday, 17 August 2015 at 22:21:50 UTC, Andrei Alexandrescuwrote:

* stdx.data.json.generator: I think the API for convertingin-memory JSON values to strings needs to be redone, as follows:
- JSONValue should offer a byToken range, which offers thecontents of the value one token at a time. For example, "[ 1,2, 3 ]" offers the '[' token followed by three numeric tokenswith the respective values followed by the ']' token.

For iterating tree-like structures, a callback-based seems nicer,because it can naturally use the stack for storing its state. (Iassume std.concurrency.Generator is too heavy-weight for thiscase.)

- On top of byToken it's immediate to implement a method (saytoJSON or toString) that accepts an output range of charactersand formatting options.

If there really needs to be a range, `joiner` and `copy` shoulddo the job.

- On top of the method above with output range, implementing atoString overload that returns a string for convenience is atwo-liner. However, it shouldn't return a "string"; Phobos APIsshould avoid "hardcoding" the string type. Instead, it shouldreturn a user-chosen string type (including reference countingstrings).


`to!string`, for compatibility with std.conv.

- While at it make prettyfication a flag in the options, notits own part of the function name.


(That's already done.)

* stdx.data.json.lexer:
- I assume the idea was to accept ranges of integrals to mean"there's some raw input from a file". This seems to be a bitoverdone, e.g. there's no need to accept signed integers or64-bit integers. I suggest just going with the three charactertypes.
- I see tokenization accepts input ranges. This forces thetokenizer to store its own copy of things, which is no doubtthe business of appenderFactory. Here the departure of thecurrent approach from what I think should become canonicalPhobos APIs deepens for multiple reasons. First,appenderFactory does allow customization of the appendoperation (nice) but that's not enough to allow the user tocustomize the lifetime of the created strings, which is usuallyreflected in the string type itself. So the lexing methodshould be parameterized by the string type used. (By defaultstring (as is now) should be fine.) Therefore instead ofcustomizing the append method just customize the string typeused in the token.
- The lexer should internally take optimization opportunities,e.g. if the string type is "string" and the lexed type is also"string", great, just use slices of the input instead ofappending them to the tokens.
- As a consequence the JSONToken type also needs to beparameterized by the type of its string that holds the payload.I understand this is a complication compared to the currentapproach, but I don't see an out. In the grand scheme of thingsit seems a necessary evil: tokens may or may not need a meansto manage lifetime of their payload, and that's determined bythe type of the payload. Hopefully simplifications in otherareas of the API would offset this.

I've never seen JSON encoded in anything other than UTF-8. Is itreally necessary to complicate everything for such an infrequentniche case?

- At token level there should be no number parsing. Just storethe payload with the token and leave it for later. Very oftennumbers are converted without there being a need, and theprocess is costly. This also nicely sidesteps the entire matterof bigints, floating point etc. at this level.
- Also, at token level strings should be stored with escapesunresolved. If the user wants a string with the escapesresolved, a lazy range does it.

This was already suggested, and it looks like a good idea, thoughthere was an objection because of possible performance costs. Theother objection, that it requires an allocation, is no longervalid if sliceable input is used.

- Validating UTF is tricky; I've seen some discussion in thisthread about it. On the face of it JSON only accepts valid UTFcharacters. As such, a modularity-based argument is to pipe UTFvalidation before tokenization. (We need a lazy UTF validatorand sanitizer stat!) An efficiency-based argument is to dovalidation during tokenization. I'm inclining in favor ofmodularization, which allows us to focus on one thing at a timeand do it well, instead of duplicationg validation everywhere.Note that it's easy to write routines that do JSON tokenizationand leave UTF validation for later, so there's a lot offlexibility in composing validation with JSONization.

Well, in an ideal world, there should be no difference inperformance between manually combined tokenization/validation,and composed ranges. We should practice what we preach here.

* stdx.data.json.parser:
- FWIW I think the whole thing with accommodating BigInt etc.is an exaggeration. Just stick with long and double.

Or, as above, leave it to the end user and provide a `to(T)`method that can support built-in types and `BigInt` alike.

Re: std.data.json formal review

Reply via email to