adriangb commented on issue #6736:
URL: https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781436167

   Digging deep into some Spark code I found some pretty enlightening 
information about how this will actually be encoded into Parquet: 
https://github.com/apache/spark/commit/3c3d1a6a9f6daf6db5148f1423f49f4bce142858#diff-ca9eeead72220965c7bbd52631f7125d4c1ef22b898e5baec83abc7be9495325
   
   So it seems that https://github.com/apache/datafusion/issues/11745 will 
ultimately be a blocker for proper support.
   
   I think the things we'll need here are:
   - Ability to project individual struct fields, in particular `column -> 
typed_value -> field_name`, for selection and during predicate pushdown pruning
   - Functions that operate on the entire structure and know how to parse the 
binary metadata/value fields
   - A type that you can declare at the schema level that doesn't force you to 
exhaustively define the unknown typed fields of the struct
   - Statistics support for nested struct fields
   
   On the DataFusion side I think all we need is something like 
https://github.com/apache/datafusion/pull/15057 to allow rewriting a filter or 
projection such as `variant_get(col, 'key') = 5` into 
`"col.typed_value.key.typed_value" = 5` on a per-file level if we see from the 
file schema that `a` is shredded. Then if all of the above is in place stats 
filtering, selecting reading of the column for filtering / projection, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to