On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <and...@dunslane.net> wrote: > Here's a new set of patches, with I think everything except the error case > Asserts attended to. There's a change to add semantic processing to the test > suite in patch 4, but I'd fold that into patch 1 when committing.
Thanks! 0004 did highlight one last bug for me -- the return value from semantic callbacks is ignored, so applications can't interrupt the parsing early: > + if (ostart != NULL) > + (*ostart) (sem->semstate); > + inc_lex_level(lex); Anything not JSON_SUCCESS should be returned immediately, I think, for this and the other calls. > + /* make sure we've used all the input */ > + if (lex->token_terminator - lex->token_start != ptok->len) > + return JSON_INVALID_TOKEN; nit: Debugging this, if anyone ever hits it, is going to be confusing. The error message is going to claim that a valid token is invalid, as opposed to saying that the parser has a bug. Short of introducing something like JSON_INTERNAL_ERROR, maybe an Assert() should be added at least? > + case JSON_NESTING_TOO_DEEP: > + return(_("Json nested too deep, maximum permitted depth is > 6400")); nit: other error messages use all-caps, "JSON" > + char chunk_start[100], > + chunk_end[100]; > + > + snprintf(chunk_start, 100, "%s", ib->buf.data); > + snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - > (MIN_CHUNK + 99))); > + elog(NOTICE, "incremental > manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end); Is this NOTICE intended to be part of the final commit? > + inc_state = json_parse_manifest_incremental_init(&context); > + > + buffer = pg_malloc(chunk_size + 64); These `+ 64`s (there are two of them) seem to be left over from debugging. In v7 they were just `+ 1`. -- >From this point onward, I think all of my feedback is around maintenance characteristics, which is firmly in YMMV territory, and I don't intend to hold up a v1 of the feature for it since I'm not the maintainer. :D The complexity around the checksum handling and "final chunk"-ing is unfortunate, but not a fault of this patch -- just a weird consequence of the layering violation in the format, where the checksum is inside the object being checksummed. I don't like it, but I don't think there's much to be done about it. By far the trickiest part of the implementation is the partial token logic, because of all the new ways there are to handle different types of failures. I think any changes to the incremental handling in json_lex() are going to need intense scrutiny from someone who already has the mental model loaded up. I'm going snowblind on the patch and I'm no longer the right person to review how hard it is to get up to speed with the architecture, but I'd say it's not easy. For something as security-critical as a JSON parser, I'd usually want to counteract that by sinking the observable behaviors in concrete and getting both the effective test coverage *and* the code coverage as close to 100% as we can stand. (By that, I mean that purposely introducing observable bugs into the parser should result in test failures. We're not there yet when it comes to the semantic callbacks, at least.) But I don't think the current JSON parser is held to that standard currently, so it seems strange for me to ask for that here. I think it'd be good for a v1.x of this feature to focus on simplification of the code, and hopefully consolidate and/or eliminate some of the duplicated parsing work so that the mental model isn't quite so big. Thanks! --Jacob