nevi-me opened a new pull request #8938:
URL: https://github.com/apache/arrow/pull/8938
Big one!
This implements a JSON nested list reader, which means that we can now read
`<struct<list<struct<_>>>` and other variants.
While working on this, I noticed some bugs in the reader, which I fixed.
They were:
* `<list<string>>` was not read correctly by the dictionary hack
* `<list<primitive>>` was not creating the correct list offsets, sometimes
`null` was placed in the incorrect logical location
I've also added a few benchmarks, where the nested list benchmark now
performs about ~20% slower . I'm fine with this, as we weren't always reading
values correctly anyways.
I suspect the main perf loss is from having to peek into JSON values in
order to make the nesting work.
By this, I mean that if we have `{"a": [_, _, _]}`, we extract `a` values
into a `Vec<Value>`, i.e. `[_, _, _]`.
By extracting values, we are able to then use the reader to read `&[Value]`
without caring about its key (`a`).
The downside of this approach is that we have to clone values to get
`Vec<Value>`, as I couldn't find an alternative.
I could probably defer the extraction of `[_, _, _]` for later, but I was
concerned that it was just going to make things messy.
I got lost a lot in the slough of complexity in this code.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]