[I] Schema incorrect after select over aggregate of expr [datafusion]

via GitHub Thu, 02 May 2024 04:29:16 -0700


timsaucer opened a new issue, #10346:
URL: https://github.com/apache/datafusion/issues/10346


   ### Describe the bug
   
   When you have a column that is created using `array_agg` and the column name 
is the same as the column name of the original data, calling `schema()` 
produces a result that is based on the original data rather than the aggregate 
data.
   
   To be more clear, if you call something like `.aggregate(vec![col("a"), 
col("b")], vec![array_agg(col("c")).alias("c")])` and you print the schema, 
followed by a `select_columns("c")` you will get a different schema displayed. 
This only happens if the column`c` above is a computed value, even if it is as 
simple as an expression `col("c").alias("c")`.
   
   The minimal example below shows this.
   
   
   ### To Reproduce
   
   
   The following code demonstrates that selecting the column changes the 
displayed schema.
   
   The input csv is a simple table:
   ```
   a,b,c
   1,2,3
   4,5,6
   7,8,9
   10,11,12
   ```
   
   ```
   use datafusion::prelude::*;
   
   #[tokio::main]
   async fn main() -> datafusion::error::Result<()> {
   
       let ctx = SessionContext::new();
       let mut df = 
ctx.read_csv("/Users/tsaucer/working/testing_ballista/lead_lag/example.csv", 
CsvReadOptions::default()).await?;
   
       println!("Original data:");
       df.clone().show().await?;
   
       df = df
           .clone()
           .with_column("c", col("c"))?
           .aggregate(vec![col("a"), col("b")], 
vec![array_agg(col("c")).alias("c")])?;
   
       println!("\nBefore select\n{}", df.schema().field(2));
       println!("{}", df.logical_plan().display_indent());
   
       df = df.select(vec![col("a"), col("b"), col("c")])?;
       println!("\nAfter select\n{}", df.schema().field(2));
       println!("{}", df.logical_plan().display_indent());
   
       println!("\nFinal data:");
       df.show().await?;
   
       Ok(())
   }
   ```
   
   Produces the following output:
   ```
   Original data:
   +----+----+----+
   | a  | b  | c  |
   +----+----+----+
   | 1  | 2  | 3  |
   | 4  | 5  | 6  |
   | 7  | 8  | 9  |
   | 10 | 11 | 12 |
   +----+----+----+
   
   Before select
   Field { name: "c", data_type: List(Field { name: "item", data_type: Int64, 
nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: 
true, dict_id: 0, dict_is_ordered: false, metadata: {} }
   Aggregate: groupBy=[[?table?.a, ?table?.b]], aggr=[[ARRAY_AGG(c) AS c]]
     Projection: ?table?.a, ?table?.b, ?table?.c AS c
       TableScan: ?table?
   
   After select
   Field { name: "c", data_type: Int64, nullable: true, dict_id: 0, 
dict_is_ordered: false, metadata: {} }
   Projection: ?table?.a, ?table?.b, c
     Aggregate: groupBy=[[?table?.a, ?table?.b]], aggr=[[ARRAY_AGG(c) AS c]]
       Projection: ?table?.a, ?table?.b, ?table?.c AS c
         TableScan: ?table?
   
   Final data:
   +----+----+------+
   | a  | b  | c    |
   +----+----+------+
   | 7  | 8  | [9]  |
   | 4  | 5  | [6]  |
   | 1  | 2  | [3]  |
   | 10 | 11 | [12] |
   +----+----+------+
   ```
   
   You can see the final table shown is correct, but the displayed schema is 
not.
   
   ### Expected behavior
   
   Schema should remain invariant under trivial `select` of `col` or 
`select_columns`. 
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Schema incorrect after select over aggregate of expr [datafusion]

Reply via email to