tustvold opened a new issue, #2581:
URL: https://github.com/apache/arrow-datafusion/issues/2581

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   Currently projection indices are pushed down to scans as `Vec<usize>`. This 
creates some ambiguities:
   
   * How to handle out of order or repeated indices - 
https://github.com/apache/arrow-datafusion/issues/2543
   * How to handle nested types - 
https://github.com/apache/arrow-datafusion/issues/2453
   
   To demonstrate how these problems intertwine, consider the case of
   
   ```
   Struct {
      first: Struct {
         a: Integer,
         b: Integer,
      },
      second: Struct {
         c: Integer
      }
   }
   ```
   
   If I project `["first.a", "second.c", "first.b"]` what is the resulting 
schema?
   
   **Describe the solution you'd like**
   
   I would like to propose we instead pushdown a leaf column mask, where leaf 
columns are fields with no children, as enumerated by a depth-first-scan of the 
schema tree. This avoids any ordering ambiguities, whilst also being relatively 
straightforward to implement and interpret.
   
   I recently introduced a similar concept to the parquet reader 
https://github.com/apache/arrow-rs/pull/1716. We could theoretically lift this 
into arrow-rs, potentially adding support to RecordBatch for it, and then use 
this in DataFusion.
   
   **Describe alternatives you've considered**
   
   We could not support nested pushdown
   
   **Additional context**
   
   Currently pushdown for nested types in ParquetExec is broken - 
https://github.com/apache/arrow-datafusion/issues/2453
   
   Thoughts @andygrove @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to