Re: WIP Incremental JSON Parser

Jacob Champion Fri, 29 Mar 2024 14:21:32 -0700

On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <and...@dunslane.net> wrote:
> Here's a new set of patches, with I think everything except the error case 
> Asserts attended to. There's a change to add semantic processing to the test 
> suite in patch 4, but I'd fold that into patch 1 when committing.


Thanks! 0004 did highlight one last bug for me -- the return value
from semantic callbacks is ignored, so applications can't interrupt
the parsing early:

> +                       if (ostart != NULL)
> +                           (*ostart) (sem->semstate);
> +                       inc_lex_level(lex);

Anything not JSON_SUCCESS should be returned immediately, I think, for
this and the other calls.

> +           /* make sure we've used all the input */
> +           if (lex->token_terminator - lex->token_start != ptok->len)
> +               return JSON_INVALID_TOKEN;

nit: Debugging this, if anyone ever hits it, is going to be confusing.
The error message is going to claim that a valid token is invalid, as
opposed to saying that the parser has a bug. Short of introducing
something like JSON_INTERNAL_ERROR, maybe an Assert() should be added
at least?

> +       case JSON_NESTING_TOO_DEEP:
> +           return(_("Json nested too deep, maximum permitted depth is 
> 6400"));

nit: other error messages use all-caps, "JSON"

> +       char        chunk_start[100],
> +                   chunk_end[100];
> +
> +       snprintf(chunk_start, 100, "%s", ib->buf.data);
> +       snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - 
> (MIN_CHUNK + 99)));
> +       elog(NOTICE, "incremental 
> manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end);

Is this NOTICE intended to be part of the final commit?

> +       inc_state = json_parse_manifest_incremental_init(&context);
> +
> +       buffer = pg_malloc(chunk_size + 64);

These `+ 64`s (there are two of them) seem to be left over from
debugging. In v7 they were just `+ 1`.

--

>From this point onward, I think all of my feedback is around
maintenance characteristics, which is firmly in YMMV territory, and I
don't intend to hold up a v1 of the feature for it since I'm not the
maintainer. :D

The complexity around the checksum handling and "final chunk"-ing is
unfortunate, but not a fault of this patch -- just a weird consequence
of the layering violation in the format, where the checksum is inside
the object being checksummed. I don't like it, but I don't think
there's much to be done about it.

By far the trickiest part of the implementation is the partial token
logic, because of all the new ways there are to handle different types
of failures. I think any changes to the incremental handling in
json_lex() are going to need intense scrutiny from someone who already
has the mental model loaded up. I'm going snowblind on the patch and
I'm no longer the right person to review how hard it is to get up to
speed with the architecture, but I'd say it's not easy.

For something as security-critical as a JSON parser, I'd usually want
to counteract that by sinking the observable behaviors in concrete and
getting both the effective test coverage *and* the code coverage as
close to 100% as we can stand. (By that, I mean that purposely
introducing observable bugs into the parser should result in test
failures. We're not there yet when it comes to the semantic callbacks,
at least.) But I don't think the current JSON parser is held to that
standard currently, so it seems strange for me to ask for that here.

I think it'd be good for a v1.x of this feature to focus on
simplification of the code, and hopefully consolidate and/or eliminate
some of the duplicated parsing work so that the mental model isn't
quite so big.

Thanks!
--Jacob

Re: WIP Incremental JSON Parser

Reply via email to