On Monday, 17 August 2015 at 22:21:50 UTC, Andrei Alexandrescu
wrote:
* stdx.data.json.generator: I think the API for converting
in-memory JSON values to strings needs to be redone, as follows:
- JSONValue should offer a byToken range, which offers the
contents of the value one token at a time. For example, "[ 1,
2, 3 ]" offers the '[' token followed by three numeric tokens
with the respective values followed by the ']' token.
For iterating tree-like structures, a callback-based seems nicer,
because it can naturally use the stack for storing its state. (I
assume std.concurrency.Generator is too heavy-weight for this
case.)
- On top of byToken it's immediate to implement a method (say
toJSON or toString) that accepts an output range of characters
and formatting options.
If there really needs to be a range, `joiner` and `copy` should
do the job.
- On top of the method above with output range, implementing a
toString overload that returns a string for convenience is a
two-liner. However, it shouldn't return a "string"; Phobos APIs
should avoid "hardcoding" the string type. Instead, it should
return a user-chosen string type (including reference counting
strings).
`to!string`, for compatibility with std.conv.
- While at it make prettyfication a flag in the options, not
its own part of the function name.
(That's already done.)
* stdx.data.json.lexer:
- I assume the idea was to accept ranges of integrals to mean
"there's some raw input from a file". This seems to be a bit
overdone, e.g. there's no need to accept signed integers or
64-bit integers. I suggest just going with the three character
types.
- I see tokenization accepts input ranges. This forces the
tokenizer to store its own copy of things, which is no doubt
the business of appenderFactory. Here the departure of the
current approach from what I think should become canonical
Phobos APIs deepens for multiple reasons. First,
appenderFactory does allow customization of the append
operation (nice) but that's not enough to allow the user to
customize the lifetime of the created strings, which is usually
reflected in the string type itself. So the lexing method
should be parameterized by the string type used. (By default
string (as is now) should be fine.) Therefore instead of
customizing the append method just customize the string type
used in the token.
- The lexer should internally take optimization opportunities,
e.g. if the string type is "string" and the lexed type is also
"string", great, just use slices of the input instead of
appending them to the tokens.
- As a consequence the JSONToken type also needs to be
parameterized by the type of its string that holds the payload.
I understand this is a complication compared to the current
approach, but I don't see an out. In the grand scheme of things
it seems a necessary evil: tokens may or may not need a means
to manage lifetime of their payload, and that's determined by
the type of the payload. Hopefully simplifications in other
areas of the API would offset this.
I've never seen JSON encoded in anything other than UTF-8. Is it
really necessary to complicate everything for such an infrequent
niche case?
- At token level there should be no number parsing. Just store
the payload with the token and leave it for later. Very often
numbers are converted without there being a need, and the
process is costly. This also nicely sidesteps the entire matter
of bigints, floating point etc. at this level.
- Also, at token level strings should be stored with escapes
unresolved. If the user wants a string with the escapes
resolved, a lazy range does it.
This was already suggested, and it looks like a good idea, though
there was an objection because of possible performance costs. The
other objection, that it requires an allocation, is no longer
valid if sliceable input is used.
- Validating UTF is tricky; I've seen some discussion in this
thread about it. On the face of it JSON only accepts valid UTF
characters. As such, a modularity-based argument is to pipe UTF
validation before tokenization. (We need a lazy UTF validator
and sanitizer stat!) An efficiency-based argument is to do
validation during tokenization. I'm inclining in favor of
modularization, which allows us to focus on one thing at a time
and do it well, instead of duplicationg validation everywhere.
Note that it's easy to write routines that do JSON tokenization
and leave UTF validation for later, so there's a lot of
flexibility in composing validation with JSONization.
Well, in an ideal world, there should be no difference in
performance between manually combined tokenization/validation,
and composed ranges. We should practice what we preach here.
* stdx.data.json.parser:
- FWIW I think the whole thing with accommodating BigInt etc.
is an exaggeration. Just stick with long and double.
Or, as above, leave it to the end user and provide a `to(T)`
method that can support built-in types and `BigInt` alike.