Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

via GitHub Thu, 26 Mar 2026 04:47:36 -0700


ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4133967253


   You’re right that injectivity isn’t the right lens here. This rewrite 
doesn’t assume (f) is injective. We intentionally rewrite to GROUP BY g, f(a) 
(not GROUP BY g, a) so cases like lower('Abc') and lower('aBC') correctly 
collapse to one distinct bucket.
   
   The current lower/upper/cast allowlist is a conservative scope guard, not a 
correctness requirement: it keeps the first version (default off) limited to a 
small set of well-understood, deterministic expression shapes we have 
tests/benchmarks for, while we gather more performance + edge-case coverage 
(e.g. other scalar functions, float/NaN semantics, dictionary types, etc.). We 
already block volatile expressions, but “any non-volatile expr” still opens a 
large surface area.
   
   > I have one minor question. In the scenario of rewriting "g, count(distinct 
lower(a))" to "group by g, lower(a)", why restrict it to only "lower", "upper", 
and "cast"? I don’t think any restriction is needed here. If the intention is 
to express an injective function, then "lower" is not one. If it is an 
injective function, the rewrite could simply be "group by g, a". Could you 
please clarify the design rationale behind this?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: default multi COUNT(DISTINCT) logical optimizer rewrite [datafusion]

Reply via email to