Whatsonyourmind commented on issue #21310: URL: https://github.com/apache/datafusion/issues/21310#issuecomment-4182732012
A correctness subtlety worth noting: the UNION DISTINCT to OR-filter rewrite is only valid when the projected column sets are identical and come from the same source. But even then, the DISTINCT semantics introduce a nuance. Consider: `SELECT a FROM t WHERE a=1 UNION SELECT a FROM t WHERE a=1`. The original produces one row (dedup removes the duplicate). The naive rewrite `SELECT DISTINCT a FROM t WHERE a=1 OR a=1` also produces one row -- correct. But what about: ```sql SELECT a, b FROM t WHERE a=1 -- might return (1, 'x'), (1, 'y') UNION SELECT a, b FROM t WHERE a=2 -- might return (2, 'z') ``` Rewrite: `SELECT DISTINCT a, b FROM t WHERE a=1 OR a=2` -- this is semantically equivalent and correct. However, the rule should verify there are no correlated subqueries or non-deterministic functions (like `random()`, `now()`) in the SELECT list, since evaluating them once vs. twice can produce different results. The cost model should also check that `OR`-combining the predicates doesn't defeat index utilization. If each branch can use an index seek but the OR'd predicate forces a full scan, the "optimization" could regress. A cardinality-aware check -- only rewrite when the combined selectivity suggests a scan is cheaper than multiple index seeks -- would make this robust. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
