Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

via GitHub Thu, 26 Mar 2026 05:00:35 -0700


ydgandhi commented on code in PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#discussion_r2994387765



##########
docs/source/user-guide/configs.md:
##########
@@ -144,6 +144,7 @@ The following configuration settings are available:
 | datafusion.optimizer.enable_aggregate_dynamic_filter_pushdown           | 
true                      | When set to true, the optimizer will attempt to 
push down Aggregate dynamic filters into the file scan phase.                   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                             
                                                                                
                                                                                
                                                                                
                                                                 |
 | datafusion.optimizer.enable_dynamic_filter_pushdown                     | 
true                      | When set to true attempts to push down dynamic 
filters generated by operators (TopK, Join & Aggregate) into the file scan 
phase. For example, for a query such as `SELECT * FROM t ORDER BY timestamp 
DESC LIMIT 10`, the optimizer will attempt to push down the current top 10 
timestamps that the TopK operator references into the file scans. This means 
that if we already have 10 timestamps in the year 2025 any files that only have 
timestamps in the year 2024 can be skipped / pruned at various stages in the 
scan. The config will suppress `enable_join_dynamic_filter_pushdown`, 
`enable_topk_dynamic_filter_pushdown` & 
`enable_aggregate_dynamic_filter_pushdown` So if you disable 
`enable_topk_dynamic_filter_pushdown`, then enable 
`enable_dynamic_filter_pushdown`, the `enable_topk_dynamic_filter_pushdown` 
will be overridden.                                                             
        
                                                                                
                                                                                
                                                                                
                                                                 |
 | datafusion.optimizer.filter_null_join_keys                              | 
false                     | When set to true, the optimizer will insert filters 
before a join between a nullable and non-nullable column to filter out nulls on 
the nullable side. This filter can add additional overhead when the file format 
does not fully support predicate push down.                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                         
                                                                                
                                                                                
                                                                                
                                                                 |
+| datafusion.optimizer.enable_multi_distinct_count_rewrite              | 
false                     | When set to true, the optimizer may rewrite a 
single aggregate with multiple `COUNT(DISTINCT …)` (with `GROUP BY`) into joins 
of per-distinct sub-aggregates. This can reduce peak memory but adds join work; 
default off until benchmarks support enabling broadly.                          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                               
                                                                                
                                                                                
                                                                                
                                                                |

Review Comment:
   I have made small changes to this in my latest commit, Hope this helps.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

Reply via email to