goldmedal commented on PR #13267: URL: https://github.com/apache/datafusion/pull/13267#issuecomment-2473934915
> @goldmedal - does the "Push-down doesn’t work" part mean that projections are not pushed correctly (does not actually work) or something else? Could you please add more clarity on this. Going to test with `sqllogictests` as well - thank you for pointing to this. > For example, the test in `joins.slt`. https://github.com/apache/datafusion/blob/4e1f8391d4c2736ed92e5acba16dab51fb053aed/datafusion/sqllogictest/test_files/joins.slt#L1418-L1422 After enabling `optimize_projections_preserve_existing_projections`, the projection for `HashJoinExec` will be removed. ```diff @@ -1425,10 +1433,9 @@ logical_plan 01)Projection: count(alias1) AS count(DISTINCT join_t1.t1_id) 02)--Aggregate: groupBy=[[]], aggr=[[count(alias1)]] 03)----Aggregate: groupBy=[[join_t1.t1_id AS alias1]], aggr=[[]] -04)------Projection: join_t1.t1_id -05)--------Inner Join: join_t1.t1_id = join_t2.t2_id -06)----------TableScan: join_t1 projection=[t1_id] -07)----------TableScan: join_t2 projection=[t2_id] +04)------Inner Join: join_t1.t1_id = join_t2.t2_id +05)--------TableScan: join_t1 projection=[t1_id] +06)--------TableScan: join_t2 projection=[t2_id] physical_plan 01)ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT join_t1.t1_id)] 02)--AggregateExec: mode=Final, gby=[], aggr=[count(alias1)] @@ -1436,7 +1443,7 @@ physical_plan 04)------AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)] 05)--------AggregateExec: mode=SinglePartitioned, gby=[t1_id@0 as alias1], aggr=[] 06)----------CoalesceBatchesExec: target_batch_size=2 -07)------------HashJoinExec: mode=Partitioned, join_type=Inner, on=[(t1_id@0, t2_id@0)], projection=[t1_id@0] +07)------------HashJoinExec: mode=Partitioned, join_type=Inner, on=[(t1_id@0, t2_id@0)] 08)--------------CoalesceBatchesExec: target_batch_size=2 09)----------------RepartitionExec: partitioning=Hash([t1_id@0], 2), input_partitions=2 10)------------------RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1 ``` In this case, the pushdown doesn't work, but there are other similar cases. If you set this config to true by default, you can use the completed mode of the sqllogictests to find the cases in whichs the physical plan has changed. ``` cargo test --test sqllogictests -- --complete ``` > The main question is: do you think this work is very specific, meaning the improvement should be added directly to the project, or do you believe other developers could benefit from it, so we should continue exploring the best way to integrate/add this to the DataFusion? Ideally, we should optimize the logical plans while retaining completeness, enabling us to unparse them back to SQL text easily—for example, by preserving `Projection`. Meanwhile, the physical planner can still generate an efficient physical plan with the necessary optimizations (e.g., Join with pushdown). If we can enable this configuration by default without altering any original physical plan, that would be the best-case scenario. However, achieving this would be a huge epic, with many issues needing resolution before enabling it by default 🤔. One compromise is what I mentioned previously in https://github.com/apache/datafusion/pull/13267#issuecomment-2466215869. We could introduce a configuration (or simply rename `optimize_projections_preserve_existing_projections`) to switch the optimization to an unparse-friendly mode. Additionally, we should document that users should avoid executing SQL with this configuration enabled, as some SQL queries may fail in this mode. https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/same_column_name_cross_join.slt -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org