[
https://issues.apache.org/jira/browse/IMPALA-12006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711259#comment-17711259
]
ASF subversion and git services commented on IMPALA-12006:
----------------------------------------------------------
Commit 901ce8ac4fd60daaa81ded2d6238163dab0c2a30 in impala's branch
refs/heads/master from Aman Sinha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=901ce8ac4 ]
IMPALA-12006: Improve cardinality estimation for joins involving multiple
conjuncts
When an inner or outer join involves conjuncts such as the following:
SELECT * FROM t1 inner join
(SELECT a2, MAX(b2) as max_b2 FROM t2 GROUP BY a2) s2
ON t1.a1 = s2.a2 AND t1.b1 = s2.max_b2
the stats for the second conjunct involving the MAX expression gets
added to the 'otherEqJoinStats' list. These stats were being used
only when no other equijoin conjunct (involving base columns) was
present but not for the above case, thus leading to over-estimation.
The main change in this patch is an improvement to the cardinality
estimate for such cases by considering the conjuncts in the
'otherEqJoinStats list' in combination with the equijoin conjuncts.
A second change is the addition of a new query option which is a
floating point number between 0 to 1 inclusive:
join_selectivity_correlation_factor
It defaults to 0.0 which preserves the existing behavior of using
the Minimum selectivity of the conjuncts. Given multiple
join conjuncts C1,C2..Cn having comparable selectivity, we sometimes
see over-estimation by several orders of magnitude. Setting the above
query option to a value higher than 0 first computes the product of the
selectivities: sel(C1) * sel(C2) .. sel(Cn) and then scales it by
dividing by the join_selectivity_correlation_factor.
Note that this setting applies to all the joins in the query so it has
it's trade offs. Another approach is to use per join hints but there
are limitations of the hints approach too.
Testing:
- Added planner tests with a combination of outer and inner join with
different values of join_selectivity_correlation_factor
- Ran PlannerTest and TpcdsPlannerTest
Change-Id: I845d778a58404af834f7501fc8157a5a4b4bcc35
Reviewed-on: http://gerrit.cloudera.org:8080/19682
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Kurt Deschler <[email protected]>
Reviewed-by: Quanlong Huang <[email protected]>
> Outer/inner join cardinality highly overestimated
> -------------------------------------------------
>
> Key: IMPALA-12006
> URL: https://issues.apache.org/jira/browse/IMPALA-12006
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 4.2.0
> Reporter: Aman Sinha
> Assignee: Aman Sinha
> Priority: Major
>
> In one of the use cases, we have seen the cardinality estimate for left outer
> join highly overestimated. The plan is complex and only a partial output is
> shown below (with the column names anonymized):
> {noformat}
> 57:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
> | hash-table-id=121
> | hash predicates: a.id = a.id
> | fk/pk conjuncts: none
> | mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB
> thread-reservation=0
> | tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N,29N,31N
> row-size=2.63KB cardinality=3.90T
> | in pipelines: 06(GETNEXT), 26(OPEN)
> |
> |--F1253:PLAN FRAGMENT hosts=13 instances=13
> | | Per-Instance Resources: mem-estimate=1.10GB mem-reservation=204.00MB
> thread-reservation=1
> | JOIN BUILD
> | | join-table-id=121 plan-id=122 cohort-id=25
> | | build expressions: a.id
> | | mem-estimate=1.08GB mem-reservation=204.00MB spill-buffer=2.00MB
> thread-reservation=0
> | |
> | 1758:EXCHANGE [BROADCAST]
> | | mem-estimate=20.87MB mem-reservation=0B thread-reservation=0
> | | tuple-ids=29,31 row-size=85B cardinality=9.56M
> | | in pipelines: 26(GETNEXT)
> | |
> ...
> ...
> ...
> 56:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
> | hash predicates: ifnull(a.id, a.id) = a.id
> | fk/pk conjuncts: assumed fk/pk
> | mem-estimate=0B mem-reservation=0B spill-buffer=2.00MB
> thread-reservation=0
> | tuple-ids=4N,3,6N,8N,9N,11N,14N,16N,19N,21N,24N,26N row-size=2.55KB
> cardinality=14.97G
> | in pipelines: 06(GETNEXT), 22(OPEN)
> {noformat}
> Note that the left input of the join is estimated as 14.97G rows, right input
> as 9.56M rows but the LOJ estimate is 3.9T rows. We need to investigate why
> that is so and fix it. The NDV of the based column involved in the join is
> 36661 but in the lower join there are functions involved in the join
> condition.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]