thinkharderdev commented on issue #7845:
URL:
https://github.com/apache/arrow-datafusion/issues/7845#issuecomment-1779821351
We (Coralogix) built our own binary jsonb format (we call it jsona for json
arrow) that we are planning on open-sourcing in the next couple months
(hopefully Jan/Feb time frame, need to fill in some details) that has some
nifty features for vectorized processing.
In broad strokes we took the `Tape` representation and tweaked that a bit.
So a `JsonaArray` is encoded as
```
DataType::Struct(Fields::from(vec![
Field::new(
"nodes",
DataType::List(Arc::new(Field::new("item", DataType::UInt32,
true))),
false,
),
Field::new(
"keys",
DataType::List(Arc::new(Field::new(
"item",
DataType::Dictionary(Box::new(DataType::UInt16),
Box::new(DataType::Utf8)),
true,
))),
false,
),
Field::new(
"values",
DataType::List(Arc::new(Field::new("item", DataType::Utf8,
true))),
false,
),
]))
```
So you have three child arrays:
`nodes` - Basically `TapeElement` encoded as a `u32` where the top 4 bits
encode the type (StartObject,EndObject,StartArray,EndArray,Key,String,Number)
and the bottom 28 bits store an offset
`keys` - The JSON keys stored in a dict array
`values` - The JSON leaf values
If this is something other's would be interested in, we would be happy to
upstream it into arrow-rs proper as a native array type (not in the sense of
adding the the arrow spec but as something with "native" APIs in arrow-rs to
avoid the ceremony around dealing with struct arrays) and add support in
DataFusion.
The benefits of this over just using JSONB in a regular binary array (and
the reason we built it) are roughly:
1. You can test for existence/non-existence/nullity of a json path using
just the nodes/keys arrays which are generally quite compact and cache
friendly. This is especially helpful in cases where you are doing predicate
pushdown into parquet since you can potentially prune significant IO from
reading the values array
2. Manipulating the "structure" (removing paths, inserting paths, etc) are
quite efficient as they are mostly manipulating the `nodes` array.
3. It's very efficient to serialize back to a JSON string
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]