ydgandhi commented on PR #21088:
URL: https://github.com/apache/datafusion/pull/21088#issuecomment-4133967253
You’re right that injectivity isn’t the right lens here. This rewrite
doesn’t assume (f) is injective. We intentionally rewrite to GROUP BY g, f(a)
(not GROUP BY g, a) so cases like lower('Abc') and lower('aBC') correctly
collapse to one distinct bucket.
The current lower/upper/cast allowlist is a conservative scope guard, not a
correctness requirement: it keeps the first version (default off) limited to a
small set of well-understood, deterministic expression shapes we have
tests/benchmarks for, while we gather more performance + edge-case coverage
(e.g. other scalar functions, float/NaN semantics, dictionary types, etc.). We
already block volatile expressions, but “any non-volatile expr” still opens a
large surface area.
> I have one minor question. In the scenario of rewriting "g, count(distinct
lower(a))" to "group by g, lower(a)", why restrict it to only "lower", "upper",
and "cast"? I don’t think any restriction is needed here. If the intention is
to express an injective function, then "lower" is not one. If it is an
injective function, the rewrite could simply be "group by g, a". Could you
please clarify the design rationale behind this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]