Whatsonyourmind commented on issue #21310:
URL: https://github.com/apache/datafusion/issues/21310#issuecomment-4182732012

   A correctness subtlety worth noting: the UNION DISTINCT to OR-filter rewrite 
is only valid when the projected column sets are identical and come from the 
same source. But even then, the DISTINCT semantics introduce a nuance.
   
   Consider: `SELECT a FROM t WHERE a=1 UNION SELECT a FROM t WHERE a=1`. The 
original produces one row (dedup removes the duplicate). The naive rewrite 
`SELECT DISTINCT a FROM t WHERE a=1 OR a=1` also produces one row -- correct. 
But what about:
   
   ```sql
   SELECT a, b FROM t WHERE a=1  -- might return (1, 'x'), (1, 'y')
   UNION
   SELECT a, b FROM t WHERE a=2  -- might return (2, 'z')
   ```
   
   Rewrite: `SELECT DISTINCT a, b FROM t WHERE a=1 OR a=2` -- this is 
semantically equivalent and correct.
   
   However, the rule should verify there are no correlated subqueries or 
non-deterministic functions (like `random()`, `now()`) in the SELECT list, 
since evaluating them once vs. twice can produce different results.
   
   The cost model should also check that `OR`-combining the predicates doesn't 
defeat index utilization. If each branch can use an index seek but the OR'd 
predicate forces a full scan, the "optimization" could regress. A 
cardinality-aware check -- only rewrite when the combined selectivity suggests 
a scan is cheaper than multiple index seeks -- would make this robust.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to