xiedeyantu commented on issue #21310: URL: https://github.com/apache/datafusion/issues/21310#issuecomment-4183978004
> A correctness subtlety worth noting: the UNION DISTINCT to OR-filter rewrite is only valid when the projected column sets are identical and come from the same source. But even then, the DISTINCT semantics introduce a nuance. > > Consider: `SELECT a FROM t WHERE a=1 UNION SELECT a FROM t WHERE a=1`. The original produces one row (dedup removes the duplicate). The naive rewrite `SELECT DISTINCT a FROM t WHERE a=1 OR a=1` also produces one row -- correct. But what about: > > SELECT a, b FROM t WHERE a=1 -- might return (1, 'x'), (1, 'y') > UNION > SELECT a, b FROM t WHERE a=2 -- might return (2, 'z') > Rewrite: `SELECT DISTINCT a, b FROM t WHERE a=1 OR a=2` -- this is semantically equivalent and correct. > > However, the rule should verify there are no correlated subqueries or non-deterministic functions (like `random()`, `now()`) in the SELECT list, since evaluating them once vs. twice can produce different results. > > The cost model should also check that `OR`-combining the predicates doesn't defeat index utilization. If each branch can use an index seek but the OR'd predicate forces a full scan, the "optimization" could regress. A cardinality-aware check -- only rewrite when the combined selectivity suggests a scan is cheaper than multiple index seeks -- would make this robust. @Whatsonyourmind Thank you for your feedback! These key points are truly valuable! I'll try to answer your questions. Regarding the non-deterministic expressions and subqueries issues you mentioned, I partially addressed them in the PR for cases involving FILTER clauses, but I overlooked PROJECT scenarios, which need to be reviewed again for this context. Another concern is whether optimization will be beneficial in all cases. My understanding is that it will be beneficial in most scenarios, and I can't think of a situation where it would degrade performance. Let me know if my understanding is correct. Given that DataFusion is positioned as a query engine rather than a database, it is most commonly used for querying Parquet files or data lake files. These types of data are typically sorted by one or more columns as sort keys. Therefore, in general, it is rare for both subclause1 (WHERE a=1) and subclause2 (WHERE b=2) in a UNION clause to benefit from indexing. If subclause1 can leverage an index, it's unlikely that subclause2 will. Even if column 'a' is the physical sort key of the data, and subclause2 is changed to WHERE a=2, queries like a=1 OR a=2 can still benefit from data pruning. Please correct me if I have misunderstood anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
