Re: [I] Add configurable UNION DISTINCT support to FILTER rewrite optimization [datafusion]

via GitHub Fri, 03 Apr 2026 09:04:40 -0700


xiedeyantu commented on issue #21310:
URL: https://github.com/apache/datafusion/issues/21310#issuecomment-4183978004


   > A correctness subtlety worth noting: the UNION DISTINCT to OR-filter 
rewrite is only valid when the projected column sets are identical and come 
from the same source. But even then, the DISTINCT semantics introduce a nuance.
   > 
   > Consider: `SELECT a FROM t WHERE a=1 UNION SELECT a FROM t WHERE a=1`. The 
original produces one row (dedup removes the duplicate). The naive rewrite 
`SELECT DISTINCT a FROM t WHERE a=1 OR a=1` also produces one row -- correct. 
But what about:
   > 
   > SELECT a, b FROM t WHERE a=1  -- might return (1, 'x'), (1, 'y')
   > UNION
   > SELECT a, b FROM t WHERE a=2  -- might return (2, 'z')
   > Rewrite: `SELECT DISTINCT a, b FROM t WHERE a=1 OR a=2` -- this is 
semantically equivalent and correct.
   > 
   > However, the rule should verify there are no correlated subqueries or 
non-deterministic functions (like `random()`, `now()`) in the SELECT list, 
since evaluating them once vs. twice can produce different results.
   > 
   > The cost model should also check that `OR`-combining the predicates 
doesn't defeat index utilization. If each branch can use an index seek but the 
OR'd predicate forces a full scan, the "optimization" could regress. A 
cardinality-aware check -- only rewrite when the combined selectivity suggests 
a scan is cheaper than multiple index seeks -- would make this robust.
   
   @Whatsonyourmind Thank you for your feedback! These key points are truly 
valuable! I'll try to answer your questions. Regarding the non-deterministic 
expressions and subqueries issues you mentioned, I partially addressed them in 
the PR for cases involving FILTER clauses, but I overlooked PROJECT scenarios, 
which need to be reviewed again for this context. Another concern is whether 
optimization will be beneficial in all cases. My understanding is that it will 
be beneficial in most scenarios, and I can't think of a situation where it 
would degrade performance. Let me know if my understanding is correct.
   
   Given that DataFusion is positioned as a query engine rather than a 
database, it is most commonly used for querying Parquet files or data lake 
files. These types of data are typically sorted by one or more columns as sort 
keys. Therefore, in general, it is rare for both subclause1 (WHERE a=1) and 
subclause2 (WHERE b=2) in a UNION clause to benefit from indexing. If 
subclause1 can leverage an index, it's unlikely that subclause2 will. Even if 
column 'a' is the physical sort key of the data, and subclause2 is changed to 
WHERE a=2, queries like a=1 OR a=2 can still benefit from data pruning. Please 
correct me if I have misunderstood anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Add configurable UNION DISTINCT support to FILTER rewrite optimization [datafusion]

Reply via email to