On Monday, 17 August 2015 at 22:21:50 UTC, Andrei Alexandrescu wrote:
* stdx.data.json.generator: I think the API for converting in-memory JSON values to strings needs to be redone, as follows:

- JSONValue should offer a byToken range, which offers the contents of the value one token at a time. For example, "[ 1, 2, 3 ]" offers the '[' token followed by three numeric tokens with the respective values followed by the ']' token.

For iterating tree-like structures, a callback-based seems nicer, because it can naturally use the stack for storing its state. (I assume std.concurrency.Generator is too heavy-weight for this case.)


- On top of byToken it's immediate to implement a method (say toJSON or toString) that accepts an output range of characters and formatting options.

If there really needs to be a range, `joiner` and `copy` should do the job.


- On top of the method above with output range, implementing a toString overload that returns a string for convenience is a two-liner. However, it shouldn't return a "string"; Phobos APIs should avoid "hardcoding" the string type. Instead, it should return a user-chosen string type (including reference counting strings).

`to!string`, for compatibility with std.conv.


- While at it make prettyfication a flag in the options, not its own part of the function name.

(That's already done.)


* stdx.data.json.lexer:

- I assume the idea was to accept ranges of integrals to mean "there's some raw input from a file". This seems to be a bit overdone, e.g. there's no need to accept signed integers or 64-bit integers. I suggest just going with the three character types.

- I see tokenization accepts input ranges. This forces the tokenizer to store its own copy of things, which is no doubt the business of appenderFactory. Here the departure of the current approach from what I think should become canonical Phobos APIs deepens for multiple reasons. First, appenderFactory does allow customization of the append operation (nice) but that's not enough to allow the user to customize the lifetime of the created strings, which is usually reflected in the string type itself. So the lexing method should be parameterized by the string type used. (By default string (as is now) should be fine.) Therefore instead of customizing the append method just customize the string type used in the token.

- The lexer should internally take optimization opportunities, e.g. if the string type is "string" and the lexed type is also "string", great, just use slices of the input instead of appending them to the tokens.

- As a consequence the JSONToken type also needs to be parameterized by the type of its string that holds the payload. I understand this is a complication compared to the current approach, but I don't see an out. In the grand scheme of things it seems a necessary evil: tokens may or may not need a means to manage lifetime of their payload, and that's determined by the type of the payload. Hopefully simplifications in other areas of the API would offset this.

I've never seen JSON encoded in anything other than UTF-8. Is it really necessary to complicate everything for such an infrequent niche case?


- At token level there should be no number parsing. Just store the payload with the token and leave it for later. Very often numbers are converted without there being a need, and the process is costly. This also nicely sidesteps the entire matter of bigints, floating point etc. at this level.

- Also, at token level strings should be stored with escapes unresolved. If the user wants a string with the escapes resolved, a lazy range does it.

This was already suggested, and it looks like a good idea, though there was an objection because of possible performance costs. The other objection, that it requires an allocation, is no longer valid if sliceable input is used.


- Validating UTF is tricky; I've seen some discussion in this thread about it. On the face of it JSON only accepts valid UTF characters. As such, a modularity-based argument is to pipe UTF validation before tokenization. (We need a lazy UTF validator and sanitizer stat!) An efficiency-based argument is to do validation during tokenization. I'm inclining in favor of modularization, which allows us to focus on one thing at a time and do it well, instead of duplicationg validation everywhere. Note that it's easy to write routines that do JSON tokenization and leave UTF validation for later, so there's a lot of flexibility in composing validation with JSONization.

Well, in an ideal world, there should be no difference in performance between manually combined tokenization/validation, and composed ranges. We should practice what we preach here.

* stdx.data.json.parser:

- FWIW I think the whole thing with accommodating BigInt etc. is an exaggeration. Just stick with long and double.

Or, as above, leave it to the end user and provide a `to(T)` method that can support built-in types and `BigInt` alike.

Reply via email to