thinkharderdev commented on issue #7845:
URL: 
https://github.com/apache/arrow-datafusion/issues/7845#issuecomment-1779821351

   We (Coralogix) built our own binary jsonb format (we call it jsona for json 
arrow) that we are planning on open-sourcing in the next couple months 
(hopefully Jan/Feb time frame, need to fill in some details) that has some 
nifty features for vectorized processing. 
   
   In broad strokes we took the `Tape` representation and tweaked that a bit. 
So a `JsonaArray` is encoded as 
   
   ```
           DataType::Struct(Fields::from(vec![
               Field::new(
                   "nodes",
                   DataType::List(Arc::new(Field::new("item", DataType::UInt32, 
true))),
                   false,
               ),
               Field::new(
                   "keys",
                   DataType::List(Arc::new(Field::new(
                       "item",
                       DataType::Dictionary(Box::new(DataType::UInt16), 
Box::new(DataType::Utf8)),
                       true,
                   ))),
                   false,
               ),
               Field::new(
                   "values",
                   DataType::List(Arc::new(Field::new("item", DataType::Utf8, 
true))),
                   false,
               ),
           ]))
   ```
   
   So you have three child arrays:
   `nodes` - Basically `TapeElement` encoded as a `u32` where the top 4 bits 
encode the type (StartObject,EndObject,StartArray,EndArray,Key,String,Number) 
and the bottom 28 bits store an offset
   `keys` - The JSON keys stored in a dict array
   `values` - The JSON leaf values
   
   If this is something other's would be interested in, we would be happy to 
upstream it into arrow-rs proper as a native array type (not in the sense of 
adding the the arrow spec but as something with "native" APIs in arrow-rs to 
avoid the ceremony around dealing with struct arrays) and add support in 
DataFusion.
   
   The benefits of this over just using JSONB in a regular binary array (and 
the reason we built it) are roughly:
   1. You can test for existence/non-existence/nullity of a json path using 
just the nodes/keys arrays which are generally quite compact and cache 
friendly. This is especially helpful in cases where you are doing predicate 
pushdown into parquet since you can potentially prune significant IO from 
reading the values array
   2. Manipulating the "structure" (removing paths, inserting paths, etc) are 
quite efficient as they are mostly manipulating the `nodes` array.
   3. It's very efficient to serialize back to a JSON string


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to