jagill opened a new pull request, #6643:
URL: https://github.com/apache/arrow-rs/pull/6643
# Which issue does this PR close?
Closes #6558.
# Rationale for this change
Currently, a StructArray can only be deserialized from a JSON object (e.g.
`{a: 1, b: "c"}`), but some services (e.g. Presto and Trino) encode ROW types
as JSON lists (e.g. `[1, "c"]`) because this is more compact, and the schema is
known.
Arrow-json cannot currently deserialize these.
# What changes are included in this PR?
This PR adds the ability to parse JSON lists into StructArrays, if the
StructParseMode is set to ListOnly. In ListOnly mode, object-encoded structs
raise an error. Setting to ObjectOnly (the default) has the original parsing
behavior.
# Are there any user-facing changes?
Users may set the `StructParsingMode` enum to `ListOnly` to parse list-style
structs. The associated enum,
variants, and method have been documented. I'm happy to update any other
documentation.
# Discussion topics
1. I've made a JsonParseMode struct instead of a bool flag for two reasons.
One is that it's self-descriptive (what would `true` be?), and the other is
that it allows a future Mixed mode that could deserialize either. The latter
isn't currently requested by anyone.
2. I kept the error messages as similar to the old messages as possible. I
considered having more specific error messages (like "Encountered a '[' when
parsing a Struct, but the StructParseMode is ObjectOnly" or similar), but
wanted to hear opinions before I went that route.
3. I'm not attached to any name/code-style/etc, so happy to modify to fit
local conventions.
4. One requirement was that benchmarks do not regress. My running of
benchmarks have been inconclusive (see
https://gist.github.com/jagill/6749248171a1f12fb7c653ff70c5ed42). There are
often small regressions or improvements in the single-digit % range whenever I
switch between master and this PR. I suspect they are statistical but I wanted
to note these.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]