On 2023-10-25 We 09:05, Robert Haas wrote:
On Wed, Oct 25, 2023 at 7:54 AM Andrew Dunstan <and...@dunslane.net> wrote:
Robert asked me to work on this quite some time ago, and most of this
work was done last year.

Here's my WIP for an incremental JSON parser. It works and passes all
the usual json/b tests. It implements Algorithm 4.3 in the Dragon Book.
The reason I haven't posted it before is that it's about 50% slower in
pure parsing speed than the current recursive descent parser in my
testing. I've tried various things to make it faster, but haven't made
much impact. One of my colleagues is going to take a fresh look at it,
but maybe someone on the list can see where we can save some cycles.

If we can't make it faster, I guess we could use the RD parser for
non-incremental cases and only use the non-RD parser for incremental,
although that would be a bit sad. However, I don't think we can make the
RD parser suitable for incremental parsing - there's too much state
involved in the call stack.
Yeah, this is exactly why I didn't want to use JSON for the backup
manifest in the first place. Parsing such a manifest incrementally is
complicated. If we'd gone with my original design where the manifest
consisted of a bunch of lines each of which could be parsed
separately, we'd already have incremental parsing and wouldn't be
faced with these difficult trade-offs.

Unfortunately, I'm not in a good position either to figure out how to
make your prototype faster, or to evaluate how painful it is to keep
both in the source tree. It's probably worth considering how likely it
is that we'd be interested in incremental JSON parsing in other cases.
Maintaining two JSON parsers is probably not a lot of fun regardless,
but if each of them gets used for a bunch of things, that feels less
bad than if one of them gets used for a bunch of things and the other
one only ever gets used for backup manifests. Would we be interested
in JSON-format database dumps? Incrementally parsing JSON LOBs? Either
seems tenuous, but those are examples of the kind of thing that could
make us happy to have incremental JSON parsing as a general facility.

If nobody's very excited by those kinds of use cases, then this just
boils down to whether we want to (a) accept that users with very large
numbers of relation files won't be able to use pg_verifybackup or
incremental backup, (b) accept that we're going to maintain a second
JSON parser just to enable that use cas and with no other benefit, or
(c) undertake to change the manifest format to something that is
straightforward to parse incrementally. I think (a) is reasonable
short term, but at some point I think we should do better. I'm not
really that enthused about (c) because it means more work for me and
possibly more arguing, but if (b) is going to cause a lot of hassle
then we might need to consider it.


I'm not too worried about the maintenance burden. The RD routines were added in March 2013 (commit a570c98d7fa) and have hardly changed since then. The new code is not ground-breaking - it's just a different (and fairly well known) way of doing the same thing. I'd be happier if we could make it faster, but maybe it's just a fact that keeping an explicit stack, which is how this works, is slower.

I wouldn't at all be surprised if there were other good uses for incremental JSON parsing, including some you've identified.

That said, I agree that JSON might not be the best format for backup manifests, but maybe that ship has sailed.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com



Reply via email to