adriangb commented on issue #6736: URL: https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781436167
Digging deep into some Spark code I found some pretty enlightening information about how this will actually be encoded into Parquet: https://github.com/apache/spark/commit/3c3d1a6a9f6daf6db5148f1423f49f4bce142858#diff-ca9eeead72220965c7bbd52631f7125d4c1ef22b898e5baec83abc7be9495325 So it seems that https://github.com/apache/datafusion/issues/11745 will ultimately be a blocker for proper support. I think the things we'll need here are: - Ability to project individual struct fields, in particular `column -> typed_value -> field_name`, for selection and during predicate pushdown pruning - Functions that operate on the entire structure and know how to parse the binary metadata/value fields - A type that you can declare at the schema level that doesn't force you to exhaustively define the unknown typed fields of the struct - Statistics support for nested struct fields On the DataFusion side I think all we need is something like https://github.com/apache/datafusion/pull/15057 to allow rewriting a filter or projection such as `variant_get(col, 'key') = 5` into `"col.typed_value.key.typed_value" = 5` on a per-file level if we see from the file schema that `a` is shredded. Then if all of the above is in place stats filtering, selecting reading of the column for filtering / projection, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org