sgrebnov opened a new pull request, #13267: URL: https://github.com/apache/datafusion/pull/13267
## Which issue does this PR close? The `optimize_projections` optimization is very useful when used alongside unparsing logic, as it pushes down projections to the `TableScan` and ensures only required columns are fetched. The downside of this process is that the rule modifies the original plan in a way that makes it difficult to unparse, and the resultant plan is not always optimal or efficient for unparsing use cases, for example https://gist.github.com/sgrebnov/5071d2834e812b62bfdf434cf7e7e54c Original query (TPC-DS Q72) ```sql select i_item_desc ,w_warehouse_name ,d1.d_week_seq ,sum(case when p_promo_sk is null then 1 else 0 end) no_promo ,sum(case when p_promo_sk is not null then 1 else 0 end) promo ,count(*) total_cnt from catalog_sales join inventory on (cs_item_sk = inv_item_sk) join warehouse on (w_warehouse_sk=inv_warehouse_sk) join item on (i_item_sk = cs_item_sk) join customer_demographics on (cs_bill_cdemo_sk = cd_demo_sk) join household_demographics on (cs_bill_hdemo_sk = hd_demo_sk) join date_dim d1 on (cs_sold_date_sk = d1.d_date_sk) join date_dim d2 on (inv_date_sk = d2.d_date_sk) join date_dim d3 on (cs_ship_date_sk = d3.d_date_sk) left outer join promotion on (cs_promo_sk=p_promo_sk) left outer join catalog_returns on (cr_item_sk = cs_item_sk and cr_order_number = cs_order_number) where d1.d_week_seq = d2.d_week_seq and inv_quantity_on_hand < cs_quantity and d3.d_date > d1.d_date + INTERVAL '5 days' and hd_buy_potential = '501-1000' and d1.d_year = 1999 and cd_marital_status = 'S' group by i_item_desc,w_warehouse_name,d1.d_week_seq order by total_cnt desc, i_item_desc, w_warehouse_name, d_week_seq LIMIT 100; ``` Plan and query after applying `optimize_projections` rule. Notice the additional projections added after joins. <img width="739" alt="image" src="https://github.com/user-attachments/assets/c76d00e1-e3e4-423b-8dad-a48b58a9a63d"> ```sql select "i_item_desc", "w_warehouse_name", "d_week_seq", sum(case when "p_promo_sk" is null then 1 else 0 end) as "no_promo", sum(case when "p_promo_sk" is not null then 1 else 0 end) as "promo", count(1) as "total_cnt" from ( select "w_warehouse_name", "i_item_desc", "d_week_seq", "p_promo_sk" from ( select "cs_item_sk", "cs_order_number", "w_warehouse_name", "i_item_desc", "d_week_seq", "promotion"."p_promo_sk" from ( select "cs_item_sk", "cs_promo_sk", "cs_order_number", "w_warehouse_name", "i_item_desc", "d_week_seq" from ( select "cs_ship_date_sk", "cs_item_sk", ... ``` ## Rationale for this change To support unparsing plans after `optimize_projections` is applied, it is proposed to add the `optimize_projections_preserve_existing_projections` config option to prevent the optimization logic from creating or removing projections and to preserve the original structure. It ensures the query layout remains simple and readable, relying on the underlying SQL engine to apply its own optimizations during execution. ## Are these changes tested? Added test for `optimize_projections_preserve_existing_projections` config options. Unparsing have been tested by running all TPC-H and TPC-DS queries with `optimization_projections` enabled. ## Are there any user-facing changes? Yes, a new `optimize_projections_preserve_existing_projections` configuration option has been introduced, which can be specified via `SessionConfig` or at a lower level using `OptimizerContext::new_with_options`. ```rust SessionStateBuilder::new() .with_config( SessionConfig::new().with_optimize_projections_preserve_existing_projections(true), ) .build(); ``` Or ```rust let mut config = ConfigOptions::new(); config .optimizer .optimize_projections_preserve_existing_projections = preserve_projections; let optimizer_context = OptimizerContext::new_with_options(config); ``` There is no changes in default behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org