Quanlong Huang created IMPALA-15091:
---------------------------------------

             Summary: Uses HBO AggregationNode Cardinality in Distinct 
Semi-Join Optimization
                 Key: IMPALA-15091
                 URL: https://issues.apache.org/jira/browse/IMPALA-15091
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang


When deciding whether to add a distinct aggregation on the inner side of 
semi/anti joins (IMPALA-1270), the planner compares the estimated number of 
distinct groups with the join cardinality:
{code:java}
    long numDistinct = AggregationNode.estimateNumGroups(
        distinctExprs, joinInput.getCardinality(), joinInput, analyzer);
    ...
    if (joinInput.getCardinality() <= 1 ||
        numDistinct > JOIN_DISTINCT_THRESHOLD * joinInput.getCardinality()) 
{{code}
[https://github.com/apache/impala/blob/670872bc8b868ef6a6f08d524603f99c16cfb061/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java#L2168-L2184]

Using HBO stats, the join cardinality could be accurate. Comparing the 
estimated ndv with an accurate join cardinality could lead to wrong decisions. 
We can use AggregationNode cardinalities stored in HBO stats to replace the 
estimation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to