sgrebnov commented on PR #13267:
URL: https://github.com/apache/datafusion/pull/13267#issuecomment-2472851057

   @findepi  - there are 3 main challenges unparse optimized plan w/o this 
improvement
   
   **1. Column references require additional processing to project correctly**
   
   ```rust
   let mut config = ConfigOptions::new();
   config
       .optimizer
       .optimize_projections_preserve_existing_projections = false;
   let state = SessionConfig::from(config);
   let ctx = SessionContext::new_with_config(state);
   
   let optimizer = OptimizeProjections::new();
   
   ctx.sql("CREATE TABLE test (a int, b int, c int)").await?;
   let sql = "SELECT a, sum(b) as sum_qty, count(*) as count_order FROM test 
where c > 5 group by a";
   
   let df = ctx.sql(&sql).await?;
   
   let original_plan = df.logical_plan().clone();
   println!("original_plan\n:{}", original_plan.display_indent());
   println!("sql: {}", plan_to_sql(&original_plan)?.to_string());
   
   let optimized_plan = optimizer.rewrite(original_plan, &ctx.state()).data()?;
   println!("optimized_plan\n: {}", optimized_plan.display_indent());
   println!("sql: {}", plan_to_sql(&optimized_plan)?.to_string());
   ```
   
   Will produce the following
   ```console
   original_plan:
   
   Projection: test.a, sum(test.b) AS sum_qty, count(*) AS count_order
     Aggregate: groupBy=[[test.a]], aggr=[[sum(test.b), count(*)]]
       Filter: test.c > Int64(5)
         TableScan: test
   
   sql: SELECT test.a, sum(test.b) AS sum_qty, count(*) AS count_order FROM 
test WHERE (test.c > 5) GROUP BY test.a
   
   optimized_plan:
   
   Projection: test.a, sum(test.b) AS sum_qty, count(*) AS count_order
     Aggregate: groupBy=[[test.a]], aggr=[[sum(test.b), count(*)]]
       Projection: test.a, test.b
         Filter: test.c > Int64(5)
           TableScan: test projection=[a, b, c]
   
   sql: SELECT test.a, sum(test.b) AS sum_qty, count(*) AS count_order FROM 
(SELECT test.a, test.b FROM test WHERE (test.c > 5)) GROUP BY test.a
   ```
   
   As new projection added (`Projection: test.a, test.b`) we now have 
additional subquery so we can't unparse the original projection columns as-is 
anymore:
   
   The following is invalid
   
   ```
   SELECT test.a, sum(test.b) AS sum_qty, count(*) AS count_order FROM (SELECT 
test.a, test.b FROM test WHERE (test.c > 5)) GROUP BY test.a
   ```
   
   Should be re-written as
   
   ```
   SELECT a, sum(b) AS sum_qty, count(*) AS count_order FROM (SELECT test.a, 
test.b FROM test WHERE (test.c > 5)) GROUP BY a
   ```
   
   This affects not just SELECT part but also other query components (grouping, 
ordering, join, window/aggregation funcitons/etc). This could be fixed by using 
additional post processing (collecting allowed table names and aliases and 
updating columns referring to tables/aliases that are not in scope anymore). 
There are some tricky cases (union, joins where all allowed references are not 
yet collected).
   
   **2. Unprojecting aggregation and window expressions **
   
   Existing logic (few different cases) relies on Projections to identify and 
correctly unproject columns based on aggregations. With the optimization, 
Projection nodes could be removed and the overall identification approach needs 
improvements to work correct with removed Projections
   
https://github.com/apache/datafusion/blob/main/datafusion/sql/src/unparser/utils.rs#L37
   
https://github.com/apache/datafusion/blob/main/datafusion/sql/src/unparser/plan.rs#L186
   
https://github.com/apache/datafusion/blob/main/datafusion/sql/src/unparser/plan.rs#L329
 
   
   **3. Helps identify SQL body (SELECT) in general**
   
   Current unparsing logic is based on Projection to identify where to start a 
subquery (derive) - this requires improvements to work correct with removed 
Projections.
   
   
   @goldmedal - does the "Push-down doesn’t work" part mean that projections 
are not pushed correctly (does not actually work) or something else? Could you 
please add more clarity on this. Going to test with `sqllogictests` as well - 
thank you for pointing to this.
   
   The main question is: do you think this work is very specific, meaning the 
improvement should be added directly to the project, or do you believe other 
developers could benefit from it, so we should continue exploring the best way 
to integrate/add this to the DataFusion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to