adriangb opened a new issue, #19387:
URL: https://github.com/apache/datafusion/issues/19387

   The projection pushdown optimizer rule / implementations generally only push 
down a projection if it "narrows" a schema (i.e. has less output expressions 
than input expressions) and the output expressions are all columns or literals:
   
   
https://github.com/apache/datafusion/blob/d68b629dc610972295d8f310b09cd854cf250dd3/datafusion/physical-plan/src/filter.rs#L470-L471
   
   
https://github.com/apache/datafusion/blob/d68b629dc610972295d8f310b09cd854cf250dd3/datafusion/physical-plan/src/repartition/mod.rs#L1045-L1055
   
   
https://github.com/apache/datafusion/blob/d68b629dc610972295d8f310b09cd854cf250dd3/datafusion/physical-plan/src/projection.rs#L255-L268
   
   This is problematic with a plan like:
   
   ```
   copy (
     select 1 as id, named_struct('large_string_field', 'big text!', 
'small_int_field', 2) as large_struct
   )
   TO 'struct.parquet';
   
   create external table t stored as parquet location 'struct.parquet';
   
   explain format indent
   select large_struct['small_int_field'] * 2 from t where id = 1; 
   ```
   
   ```
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                            |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: get_field(t.large_struct, 
Utf8("small_int_field")) * Int64(2)                                             
                                                 |
   |               |   Filter: t.id = Int64(1)                                  
                                                                                
                            |
   |               |     TableScan: t projection=[id, large_struct], 
partial_filters=[t.id = Int64(1)]                                               
                                       |
   | physical_plan | ProjectionExec: expr=[get_field(large_struct@0, 
small_int_field) * 2 as t.large_struct[small_int_field] * Int64(2)]             
                                       |
   |               |   CoalesceBatchesExec: target_batch_size=8192              
                                                                                
                            |
   |               |     FilterExec: id@0 = 1, projection=[large_struct@1]      
                                                                                
                            |
   |               |       RepartitionExec: partitioning=RoundRobinBatch(12), 
input_partitions=1                                                              
                              |
   |               |         DataSourceExec: file_groups={1 group: 
[[Users/adrian/GitHub/datafusion/struct.parquet]]}, projection=[id, 
large_struct], file_type=parquet, predicate=id@0 = 1 |
   |               |                                                            
                                                                                
                            |
   
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to