Quanlong Huang created IMPALA-15091:
---------------------------------------
Summary: Uses HBO AggregationNode Cardinality in Distinct
Semi-Join Optimization
Key: IMPALA-15091
URL: https://issues.apache.org/jira/browse/IMPALA-15091
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Reporter: Quanlong Huang
Assignee: Quanlong Huang
When deciding whether to add a distinct aggregation on the inner side of
semi/anti joins (IMPALA-1270), the planner compares the estimated number of
distinct groups with the join cardinality:
{code:java}
long numDistinct = AggregationNode.estimateNumGroups(
distinctExprs, joinInput.getCardinality(), joinInput, analyzer);
...
if (joinInput.getCardinality() <= 1 ||
numDistinct > JOIN_DISTINCT_THRESHOLD * joinInput.getCardinality())
{{code}
[https://github.com/apache/impala/blob/670872bc8b868ef6a6f08d524603f99c16cfb061/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L2168-L2184]
Using HBO stats, the join cardinality could be accurate. Comparing the
estimated ndv with an accurate join cardinality could lead to wrong decisions.
We can use AggregationNode cardinalities stored in HBO stats to replace the
estimation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)