xudong963 edited a comment on issue #1064:
URL: 
https://github.com/apache/arrow-datafusion/issues/1064#issuecomment-937813055


   Bug located at 
https://github.com/apache/arrow-datafusion/blob/4687899957463ce81c4795a6d35d31320db0252b/datafusion/src/physical_plan/planner.rs#L836
   
   `input_dfschema` is from the logical input schema, so idx of the column is 
from the logical input schema.
   
   The idx is wrapped in physical expr and is used in 
https://github.com/apache/arrow-datafusion/blob/4687899957463ce81c4795a6d35d31320db0252b/datafusion/src/physical_plan/type_coercion.rs#L56
   
   Pay attention to the `schema`, which is from the physical input schema. So 
when the size of the logical input schema is different from the size of the 
physical input schema, the bug appears.
   
   The direct way from my brain is to get the idx of the column from the 
physical input schema, `let idx = input_schema.index_of(c.name.as_str())?;`.  
But sometimes column, logical input schema field name, and physical input 
schema field name are not same, such as the following case:
   ```sql
   select
       sum(l_extendedprice * l_discount) as revenue
   from
       lineitem
   where
           l_shipdate >= date '1994-01-01'
     and l_shipdate < date '1995-01-01'
     and l_discount between 0.06 - 0.01 and 0.06 + 0.01
     and l_quantity < 24;
   ```
   ```rust
   [datafusion/src/physical_plan/planner.rs:836] c = Column {
       relation: None,
       name: "SUM(lineitem.l_extendedprice * lineitem.l_discount)",
   }
   [datafusion/src/physical_plan/planner.rs:837] input_dfschema = DFSchema {
       fields: [
           DFField {
               qualifier: None,
               field: Field {
                   name: "SUM(lineitem.l_extendedprice * lineitem.l_discount)",
                   data_type: Float64,
                   nullable: true,
                   dict_id: 0,
                   dict_is_ordered: false,
                   metadata: None,
               },
           },
       ],
   }
   [datafusion/src/physical_plan/planner.rs:838] input_schema = Schema {
       fields: [
           Field {
               name: "SUM(lineitem.l_extendedprice Multiply 
lineitem.l_discount)",
               data_type: Float64,
               nullable: true,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: None,
           },
       ],
       metadata: {},
   }
   ```
   The second way is to wrap the union logical plan into a projection plan, but 
maybe the logical plan will be optimized. For the case mentioned by @Dandandan, 
the projection plan wrapped on the union logical plan will be optimized and 
only contains `d`. So finally there is still a bug...
     
   Please give me some suggestions about the situation, thanks! @alamb 
@Dandandan @houqp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to