hozan23 opened a new issue, #19353:
URL: https://github.com/apache/datafusion/issues/19353
### Describe the bug
When a query contains a `GROUP BY` clause, DataFusion collapses complex
projection expressions into a single Column whose name is a stringified version
of the full expression. This causes the original expression tree (`Expr`) to be
lost, existing only within the expression of the `AggregatePlan`
For the same projection expression:
Without `GROUP BY`: the projection remains a fully composed `Expr` (e.g.
Cast, ScalarFunction, etc.).
With `GROUP BY`: the projection is replaced with `Alias(Column(<stringified
expression>))`.
This prevents analyzer or optimizer rules from inspecting or rewriting the
original expression.
### To Reproduce
```rust
use std::sync::Arc;
use datafusion::{
arrow::{
array::{Int32Array, RecordBatch, StringArray, TimestampSecondArray},
datatypes::{DataType, Field, Schema, TimeUnit},
},
common::tree_node::{Transformed, TreeNode},
datasource::MemTable,
execution::{SessionStateBuilder, context::SessionContext},
logical_expr::LogicalPlan,
};
fn create_test_session() -> SessionContext {
let state = SessionStateBuilder::new().with_default_features().build();
let ctx = SessionContext::new_with_state(state);
let schema = Arc::new(Schema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
Field::new(
"_modifiedDateTime",
DataType::Timestamp(TimeUnit::Second, None),
false,
),
]));
let batch = RecordBatch::try_new(
schema.clone(),
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec!["Alice", "Bob", "Charlie"])),
Arc::new(TimestampSecondArray::from(vec![
chrono::Utc::now().timestamp(),
chrono::Utc::now().timestamp(),
chrono::Utc::now().timestamp(),
])),
],
)
.unwrap();
let table = MemTable::try_new(schema, vec![vec![batch]]).unwrap();
ctx.register_table("users", Arc::new(table)).unwrap();
ctx
}
#[tokio::main]
async fn main() {
let ctx = create_test_session();
let sql = r#"
SELECT
CAST(
DATE_FORMAT(
CAST(
TO_DATE(
DATE_FORMAT(
CAST(`_modifiedDateTime` AS TIMESTAMP),
'%Y-%m-%dT%H:%i:%S'
),
'%Y-%m-%dT%H:%i:%S'
) AS TIMESTAMP
),
'%Y-%m-%d %H:%i:%s'
) AS TIMESTAMP
) AS qt_5anmnq5myd
FROM users group by qt_5anmnq5myd;
"#;
let plan = ctx.state().create_logical_plan(sql).await.unwrap();
plan.transform(|p| match p {
LogicalPlan::Projection(ref projection) => {
println!("DEBUG {:?}", projection.expr);
Ok(Transformed::no(p))
}
_ => Ok(Transformed::no(p)),
})
.unwrap();
}
```
### Expected behavior
The projection expression should remain a structured `Expr` tree regardless
of whether a `GROUP BY` clause is present.
Specifically:
The logical plan should preserve the original composed expression (Cast,
ScalarFunction, etc.).
Analyzer and optimizer rules should be able to traverse and rewrite the
projection expression consistently for both grouped and non-grouped queries.
### Additional context
By tracing in datafusion/sql/src/select.rs, the expression is rewritten
during aggregation planning:
```
if !group_by_exprs.is_empty() || !aggr_exprs.is_empty() {
self.aggregate(...)
}
```
Before aggregation: `select_exprs` contains the fully composed ``.
After aggregation: `select_exprs_post_aggr` contains
`Alias(Column(<stringified expression>))`, and the schema contains only this
single stringified column.
This causes problems for downstream `AnalyzerRules`. In our case, we have an
analyzer that rewrites schemas to match a remote database schema, but for
grouped queries we can no longer identify or rewrite the original projection
expression.
We implemented a workaround that attempts to reconstruct the projection
expression by matching it back to the `GROUP BY` expression and re-stringifying
it, but this is fragile and not ideal.
A possible improvement would be to delay collapsing projection expressions
into stringified Columns until later(e.g. after analyzer rules or during
optimization planning), so expression structure is preserved during logical
analysis.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]