scovich opened a new pull request, #9232: URL: https://github.com/apache/arrow-rs/pull/9232
# Which issue does this PR close? - Closes https://github.com/apache/arrow-rs/issues/9204 # Rationale for this change It's not good to tolerate obviously ill-formed JSON like `[,,, 10,,, 20,,,]` # What changes are included in this PR? Reject leading and repeated commas while still tolerating at most one trailing comma, since that's a common and intuitive case. While we're at it, optimize the tape decoder state machine to eliminate redundant decision-making. The performance benefits from that optimization compensate for the performance loss due to checking separately from commas. # Are these changes tested? Yes, new unit tests cover the expected behavior change and benchmarking shows moderate overall improvement in performance. Exception: the three `xxx_hex_json` variants are very noisy and show anything from 15% speedup to 20% slowdown from run to run. But as far as I can tell they are all tape-decoding the exact same input JSON values, and any performance differences in the tape decoder should affect them equally. This leads me to conclude that those three benchmark cases are just plain unstable. # Are there any user-facing changes? JSON parsing now rejects ill-formed JSON it used to accept. Not sure if this might merit a documentation change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
