sgrebnov commented on PR #13267: URL: https://github.com/apache/datafusion/pull/13267#issuecomment-2472851057
@findepi - there are 3 main challenges unparse optimized plan w/o this improvement **1. Column references require additional processing to project correctly** ```rust let mut config = ConfigOptions::new(); config .optimizer .optimize_projections_preserve_existing_projections = false; let state = SessionConfig::from(config); let ctx = SessionContext::new_with_config(state); let optimizer = OptimizeProjections::new(); ctx.sql("CREATE TABLE test (a int, b int, c int)").await?; let sql = "SELECT a, sum(b) as sum_qty, count(*) as count_order FROM test where c > 5 group by a"; let df = ctx.sql(&sql).await?; let original_plan = df.logical_plan().clone(); println!("original_plan\n:{}", original_plan.display_indent()); println!("sql: {}", plan_to_sql(&original_plan)?.to_string()); let optimized_plan = optimizer.rewrite(original_plan, &ctx.state()).data()?; println!("optimized_plan\n: {}", optimized_plan.display_indent()); println!("sql: {}", plan_to_sql(&optimized_plan)?.to_string()); ``` Will produce the following ```console original_plan: Projection: test.a, sum(test.b) AS sum_qty, count(*) AS count_order Aggregate: groupBy=[[test.a]], aggr=[[sum(test.b), count(*)]] Filter: test.c > Int64(5) TableScan: test sql: SELECT test.a, sum(test.b) AS sum_qty, count(*) AS count_order FROM test WHERE (test.c > 5) GROUP BY test.a optimized_plan: Projection: test.a, sum(test.b) AS sum_qty, count(*) AS count_order Aggregate: groupBy=[[test.a]], aggr=[[sum(test.b), count(*)]] Projection: test.a, test.b Filter: test.c > Int64(5) TableScan: test projection=[a, b, c] sql: SELECT test.a, sum(test.b) AS sum_qty, count(*) AS count_order FROM (SELECT test.a, test.b FROM test WHERE (test.c > 5)) GROUP BY test.a ``` As new projection added (`Projection: test.a, test.b`) we now have additional subquery so we can't unparse the original projection columns as-is anymore: The following is invalid ``` SELECT test.a, sum(test.b) AS sum_qty, count(*) AS count_order FROM (SELECT test.a, test.b FROM test WHERE (test.c > 5)) GROUP BY test.a ``` Should be re-written as ``` SELECT a, sum(b) AS sum_qty, count(*) AS count_order FROM (SELECT test.a, test.b FROM test WHERE (test.c > 5)) GROUP BY a ``` This affects not just SELECT part but also other query components (grouping, ordering, join, window/aggregation funcitons/etc). This could be fixed by using additional post processing (collecting allowed table names and aliases and updating columns referring to tables/aliases that are not in scope anymore). There are some tricky cases (union, joins where all allowed references are not yet collected). **2. Unprojecting aggregation and window expressions ** Existing logic (few different cases) relies on Projections to identify and correctly unproject columns based on aggregations. With the optimization, Projection nodes could be removed and the overall identification approach needs improvements to work correct with removed Projections https://github.com/apache/datafusion/blob/main/datafusion/sql/src/unparser/utils.rs#L37 https://github.com/apache/datafusion/blob/main/datafusion/sql/src/unparser/plan.rs#L186 https://github.com/apache/datafusion/blob/main/datafusion/sql/src/unparser/plan.rs#L329 **3. Helps identify SQL body (SELECT) in general** Current unparsing logic is based on Projection to identify where to start a subquery (derive) - this requires improvements to work correct with removed Projections. @goldmedal - does the "Push-down doesn’t work" part mean that projections are not pushed correctly (does not actually work) or something else? Could you please add more clarity on this. Going to test with `sqllogictests` as well - thank you for pointing to this. The main question is: do you think this work is very specific, meaning the improvement should be added directly to the project, or do you believe other developers could benefit from it, so we should continue exploring the best way to integrate/add this to the DataFusion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org