goldmedal commented on PR #13267:
URL: https://github.com/apache/datafusion/pull/13267#issuecomment-2473934915

   > @goldmedal - does the "Push-down doesn’t work" part mean that projections 
are not pushed correctly (does not actually work) or something else? Could you 
please add more clarity on this. Going to test with `sqllogictests` as well - 
thank you for pointing to this.
   > 
   
   For example, the test in `joins.slt`.
   
https://github.com/apache/datafusion/blob/4e1f8391d4c2736ed92e5acba16dab51fb053aed/datafusion/sqllogictest/test_files/joins.slt#L1418-L1422
   After enabling `optimize_projections_preserve_existing_projections`, the 
projection for `HashJoinExec` will be removed.
   
   ```diff
   @@ -1425,10 +1433,9 @@ logical_plan
    01)Projection: count(alias1) AS count(DISTINCT join_t1.t1_id)
    02)--Aggregate: groupBy=[[]], aggr=[[count(alias1)]]
    03)----Aggregate: groupBy=[[join_t1.t1_id AS alias1]], aggr=[[]]
   -04)------Projection: join_t1.t1_id
   -05)--------Inner Join: join_t1.t1_id = join_t2.t2_id
   -06)----------TableScan: join_t1 projection=[t1_id]
   -07)----------TableScan: join_t2 projection=[t2_id]
   +04)------Inner Join: join_t1.t1_id = join_t2.t2_id
   +05)--------TableScan: join_t1 projection=[t1_id]
   +06)--------TableScan: join_t2 projection=[t2_id]
    physical_plan
    01)ProjectionExec: expr=[count(alias1)@0 as count(DISTINCT join_t1.t1_id)]
    02)--AggregateExec: mode=Final, gby=[], aggr=[count(alias1)]
   @@ -1436,7 +1443,7 @@ physical_plan
    04)------AggregateExec: mode=Partial, gby=[], aggr=[count(alias1)]
    05)--------AggregateExec: mode=SinglePartitioned, gby=[t1_id@0 as alias1], 
aggr=[]
    06)----------CoalesceBatchesExec: target_batch_size=2
   -07)------------HashJoinExec: mode=Partitioned, join_type=Inner, 
on=[(t1_id@0, t2_id@0)], projection=[t1_id@0]
   +07)------------HashJoinExec: mode=Partitioned, join_type=Inner, 
on=[(t1_id@0, t2_id@0)]
    08)--------------CoalesceBatchesExec: target_batch_size=2
    09)----------------RepartitionExec: partitioning=Hash([t1_id@0], 2), 
input_partitions=2
    10)------------------RepartitionExec: partitioning=RoundRobinBatch(2), 
input_partitions=1
   ```
   In this case, the pushdown doesn't work, but there are other similar cases. 
If you set this config to true by default, you can use the completed mode of 
the sqllogictests to find the cases in whichs the physical plan has changed.
   ```
   cargo test --test sqllogictests -- --complete
   ```
   
   
   > The main question is: do you think this work is very specific, meaning the 
improvement should be added directly to the project, or do you believe other 
developers could benefit from it, so we should continue exploring the best way 
to integrate/add this to the DataFusion?
   
   Ideally, we should optimize the logical plans while retaining completeness, 
enabling us to unparse them back to SQL text easily—for example, by preserving 
`Projection`. Meanwhile, the physical planner can still generate an efficient 
physical plan with the necessary optimizations (e.g., Join with pushdown). If 
we can enable this configuration by default without altering any original 
physical plan, that would be the best-case scenario.
   
   However, achieving this would be a huge epic, with many issues needing 
resolution before enabling it by default 🤔. 
   
   One compromise is what I mentioned previously in 
https://github.com/apache/datafusion/pull/13267#issuecomment-2466215869. We 
could introduce a configuration (or simply rename 
`optimize_projections_preserve_existing_projections`) to switch the 
optimization to an unparse-friendly mode. Additionally, we should document that 
users should avoid executing SQL with this configuration enabled, as some SQL 
queries may fail in this mode.
   
   
https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/same_column_name_cross_join.slt


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to