tustvold opened a new issue, #2581: URL: https://github.com/apache/arrow-datafusion/issues/2581
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Currently projection indices are pushed down to scans as `Vec<usize>`. This creates some ambiguities: * How to handle out of order or repeated indices - https://github.com/apache/arrow-datafusion/issues/2543 * How to handle nested types - https://github.com/apache/arrow-datafusion/issues/2453 To demonstrate how these problems intertwine, consider the case of ``` Struct { first: Struct { a: Integer, b: Integer, }, second: Struct { c: Integer } } ``` If I project `["first.a", "second.c", "first.b"]` what is the resulting schema? **Describe the solution you'd like** I would like to propose we instead pushdown a leaf column mask, where leaf columns are fields with no children, as enumerated by a depth-first-scan of the schema tree. This avoids any ordering ambiguities, whilst also being relatively straightforward to implement and interpret. I recently introduced a similar concept to the parquet reader https://github.com/apache/arrow-rs/pull/1716. We could theoretically lift this into arrow-rs, potentially adding support to RecordBatch for it, and then use this in DataFusion. **Describe alternatives you've considered** We could not support nested pushdown **Additional context** Currently pushdown for nested types in ParquetExec is broken - https://github.com/apache/arrow-datafusion/issues/2453 Thoughts @andygrove @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
