[
https://issues.apache.org/jira/browse/IMPALA-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626577#comment-16626577
]
Paul Rogers edited comment on IMPALA-7564 at 9/25/18 12:19 AM:
---------------------------------------------------------------
Great description. I think we can tease apart some of the concerns.
First, let's define some terms. In a join, we have two tables, t1 and t2. Let
|t1| be the cardinality of t1, and so on.
The goal is to efficiently join all rows in t1 with all rows in t2 where some
predicate holds. (In the worst was of a cartesian join, the predicate can just
be {{true}}.)
The description discusses NDV. This is useful to predict the number of hash
table buckets on the build side, but is not a good predictor of the number of
_rows_ on the build side.
The description also mentions the primary key (PK). A characteristic of a PK in
an RDB is that the PK is unique, enforced by a unique index. We can detect that
we might have a PK in a join:
{noformat}
SELECT ... FROM t1, t2 WHERE t1.a = t1.b
{noformat}
If {{NDV(ti.c) = |ti|}} for some column {{c}} and i in (1, 2). In practice,
there might be some wiggle room, may be we use {{NDV(ti.c) >= 0.95 * |ti|}}.
On the other hand, we can detect a foreign key if {{NDV(tj.c) << |tj|}}. (Where
<< means much less than.)
Note that these considerations ignore any predicates in the query: they are
statements about the base tables themselves.
We can further cross-check. Let's say the above suggests that {{t1.a}} is the
PK. Then, t1 is a "master" table and t2 is a "detail" table. We can check this.
It should be that {{|t1| << |t2|}}.
If everything checks out, we can guess that {{t1}} should be on the build side
of the query. We can now apply predicates to see what part of {{t1}} will
actually be considered. (Perhaps {{t1}} is the {{customer}} table, only the
'CA' customers are considered.)
But, what happens if the sleuthing above turns out to give inconclusive
results? We can fall back on a more basic technique.
Compute the cardinality of the relation that results from applying all relevant
predicates to t1. Call this r1. Do the same for t2 to get r2. The general
rule-of-thumb is to consider {{|r1|}} and {{|r2|}} and put the relation with
the smaller cardinality on the build side.
(One could go further and calculate {{|r1| * width(r1)}} where {{width(r1)}} is
the average row width. One would do this if the total memory is more important
than just the total row count, and if we expect that some tables will be much
narrower than others.)
Now for the kicker. Suppose we do both techniques and we find that the PK/FK
analysis gives one answer, the cardinality analysis gives the other. Which
would we pick?
Suppose t1 is the master table. But, after predicates {{|r1| >> |r2|}}. Would
we still pick t1 for the build side? How much smaller would {{|r2|}} have to be
than {{|r1|}} for us to change our minds?
Consider an example. We join customers and orders. But, we only choose orders
for a five-minute period. That results in, say, 1K orders but we have 1M
customers. Should we reverse the probe/build sides?
If we give heavy weight to the effective relation size, then we don't even need
the PK/FK determination. If we give heavy weight to the PK/FK analysis, then we
will choose a suboptimal plan in extreme conditions.
All of this goes to say that this issue is not quite as cut-and-dried as it
seems. An RDB certainly does not always put the PK on the build side; it often
does the relation size analysis described above.
was (Author: paul.rogers):
Great description. I think we can tease apart some of the concerns.
First, let's define some terms. In a join, we have two tables, t1 and t2. Let
|t1| be the cardinality of t1, and so on.
The goal is to efficiently join all rows in t1 with all rows in t2 where some
predicate holds. (In the worst was of a cartesian join, the predicate can just
be {{true}}.)
The description discusses NDV. This is useful to predict the number of hash
table buckets on the build side, but is not a good predictor of the number of
_rows_ on the build side.
The description also mentions the primary key (PK). A characteristic of a PK in
an RDB is that the PK is unique, enforced by a unique index. We can detect that
we might have a PK in a join:
{noformat}
SELECT ... FROM t1, t2 WHERE t1.a = t1.b
{noformat}
If {{NDV(ti.a) = |ti|}} for i in (1, 2). In practice, there might be some
wiggle room, may be we use {{NDV(ti.a) >= 0.95 * |ti|}}.
On the other hand, we can detect a foreign key if {{NDV(tj.a) << |tjj|}}.
(Where << means much less than.)
Note that these considerations ignore any predicates in the query: they are
statements about the base tables themselves.
We can further cross-check. Let's say the above suggests that {{t1.a}} is the
PK. Then, t1 is a "master" table and t2 is a "detail" table. We can check this.
It should be that {{|t1| << |t2|}}.
If everything checks out, we can guess that {{t1}} should be on the build side
of the query. We can now apply predicates to see what part of {{t1}} will
actually be considered. (Perhaps I{{t1}} is the {{customer}} table, only the
'CA' customers are considered.)
But, what happens if the sleuthing above turns out to give inconclusive
results? We can fall back on a more basic technique.
Compute the cardinality of the relation that results from applying all relevant
predicates to t1. Call this r1. Do the same for t2 to get r2. The general
rule-of-thumb is to consider {{|r1|}} and {{|r2|}} and put the relation with
the smaller cardinality on the build side.
(One could go further and calculate {{|r1| * width(r1)}} where {{width(r1)}} is
the average row width. One would do this if the total memory is more important
than just the total row count, and if we expect that some tables will be much
narrower than others.)
Now for the kicker. Suppose we do both techniques and we find that the PK/FK
analysis gives one answer, the cardinality analysis gives the other. Which
would we pick?
Suppose t1 is the master table. But, after predicates {{|r1| >> |r2|}}. Would
we still pick t1 for the build side? How much smaller would {{|r2|}} have to be
than {{|r1|}} for us to change our minds?
Consider an example. We join customers and orders. But, we only choose orders
for a five-minute period. That results in, say, 1K orders but we have 1M
customers. Should we reverse the probe/build sides?
If we give heavy weight to the effective relation size, then we don't even need
the PK/FK determination. If we give heavy weight to the PK/FK analysis, then we
will choose a suboptimal plan in extreme conditions.
All of this goes to say that this issue is not quite as cut-and-dried as it
seems. An RDB certainly does not always put the PK on the build side; it often
does the relation size analysis described above.
> Conservative FK/PK join type detection with complex equi-join conjuncts
> -----------------------------------------------------------------------
>
> Key: IMPALA-7564
> URL: https://issues.apache.org/jira/browse/IMPALA-7564
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.12.0, Impala 2.13.0, Impala 3.1.0
> Reporter: bharath v
> Priority: Major
>
> With IMPALA-5547, we predict whether a join is an FK/PK join as follows.
> {noformat}
> // Iterate over all groups of conjuncts that belong to the same joined tuple
> id pair.
> // For each group, we compute the join NDV of the rhs slots and compare
> it to the
> // number of rows in the rhs table.
> for (List<EqJoinConjunctScanSlots> fkPkCandidate:
> scanSlotsByJoinedTids.values()) {
> double jointNdv = 1.0;
> for (EqJoinConjunctScanSlots slots: fkPkCandidate) jointNdv *=
> slots.rhsNdv();
> double rhsNumRows = fkPkCandidate.get(0).rhsNumRows();
> if (jointNdv >= Math.round(rhsNumRows * (1.0 -
> FK_PK_MAX_STATS_DELTA_PERC))) {
> // We cannot disprove that the RHS is a PK.
> if (result == null) result = Lists.newArrayList();
> result.addAll(fkPkCandidate);
> }
> }
> {noformat}
> We iterate through all the "simple" equi join conjuncts on the RHS, multiply
> their NDVs and check if it close to rhsNumRows. The issue here is that this
> can result in conservative FK/Pk detection if the equi-join conjuncts are not
> simple (of the form <slotRef> = <slotRef>)
> {noformat}
> /**
> * Returns a new EqJoinConjunctScanSlots for the given equi-join conjunct
> or null if
> * the given conjunct is not of the form <SlotRef> = <SlotRef> or if the
> underlying
> * table/column of at least one side is missing stats.
> */
> public static EqJoinConjunctScanSlots create(Expr eqJoinConjunct) {
> if (!Expr.IS_EQ_BINARY_PREDICATE.apply(eqJoinConjunct)) return null;
> SlotDescriptor lhsScanSlot =
> eqJoinConjunct.getChild(0).findSrcScanSlot();
> if (lhsScanSlot == null || !hasNumRowsAndNdvStats(lhsScanSlot)) return
> null;
> SlotDescriptor rhsScanSlot =
> eqJoinConjunct.getChild(1).findSrcScanSlot();
> {noformat}
> For example, the following query contains a complex equi-join conjunct
> {{substr(l.c3, 1, 6) = substr(r.c3, 1,6)}}, so while detecting if the left
> outer join is an FK/PK, we just check if
> {{NDVs(r.c1) * NDVs(r.c2) ~ r.numRows()}} which is incorrect. (This happens
> because EqJoinConjunctScanSlots.create() returns null for any non-simple
> predicates which are not considered later).
> {noformat}
> [localhost:21000]> explain select * from test_left l left outer join
> test_right r on l.c1 = r.c1 and l.c2 = r.c2 and substr(l.c3, 1, 6) =
> substr(r.c3, 1,6);
> Query: explain select * from test_left l left outer join test_right r on l.c1
> = r.c1 and l.c2 = r.c2 and substr(l.c3, 1, 6) = substr(r.c3, 1,6)
> +-----------------------------------------------------------------------------------------+
> | Explain String
> |
> +-----------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=1.95MB Threads=5
> |
> | Per-Host Resource Estimates: Memory=66MB
> |
> |
> |
> | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> | | Per-Host Resources: mem-estimate=0B mem-reservation=0B
> thread-reservation=1 |
> | PLAN-ROOT SINK
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | |
> |
> | 04:EXCHANGE [UNPARTITIONED]
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | | tuple-ids=0,1N row-size=94B cardinality=49334767023
> |
> | | in pipelines: 00(GETNEXT)
> |
> | |
> |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=33.94MB mem-reservation=1.95MB
> thread-reservation=2 |
> | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
> |
> | | hash predicates: l.c1 = r.c1, l.c2 = r.c2, substr(l.c3, 1, 6) =
> substr(r.c3, 1, 6) |
> | | fk/pk conjuncts: none
> |
> | | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
> thread-reservation=0 |
> | | tuple-ids=0,1N row-size=94B cardinality=49334767023
> |
> | | in pipelines: 00(GETNEXT), 01(OPEN)
> |
> | |
> |
> | |--03:EXCHANGE [BROADCAST]
> |
> | | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | | | tuple-ids=1 row-size=47B cardinality=2510
> |
> | | | in pipelines: 01(GETNEXT)
> |
> | | |
> |
> | | F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |
> | | Per-Host Resources: mem-estimate=32.00MB mem-reservation=8.00KB
> thread-reservation=2 |
> | | 01:SCAN HDFS [cdh72724.test_right r, RANDOM]
> |
> | | partitions=1/1 files=1 size=8B
> |
> | | stored statistics:
> |
> | | table: rows=2510 size=8B
> |
> | | columns: all
> |
> | | extrapolated-rows=disabled max-scan-range-rows=2510
> |
> | | mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1
> |
> | | tuple-ids=1 row-size=47B cardinality=2510
> |
> | | in pipelines: 01(GETNEXT)
> |
> | |
> |
> | 00:SCAN HDFS [cdh72724.test_left l, RANDOM]
> |
> | partitions=1/1 files=1 size=8B
> |
> | stored statistics:
> |
> | table: rows=589658570 size=8B
> |
> | columns: all
> |
> | extrapolated-rows=disabled max-scan-range-rows=589658570
> |
> | mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1
> |
> | tuple-ids=0 row-size=47B cardinality=589658570
> |
> | in pipelines: 00(GETNEXT)
> |
> +-----------------------------------------------------------------------------------------+
> Fetched 48 row(s) in 0.02s
> {noformat}
> We should either consider NDVs of complex conjuncts (error-prone for obvious
> reasons) or generally assume a join to be FK/PK until and unless proven
> otherwise.
> Here is the plan if the complex conjunct is relaxed to a simple <slotRef> =
> <slotRef> type. We can see that it detects the join to be of fk/pk type.
> {noformat}
> [localhost:21000] cdh72724> explain select * from test_left l left outer join
> test_right r on l.c1 = r.c1 and l.c2 = r.c2 and l.c3 = r.c3;
> Query: explain select * from test_left l left outer join test_right r on l.c1
> = r.c1 and l.c2 = r.c2 and l.c3 = r.c3
> +-----------------------------------------------------------------------------------------+
> | Explain String
> |
> +-----------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=1.95MB Threads=5
> |
> | Per-Host Resource Estimates: Memory=66MB
> |
> |
> |
> | F02:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
> |
> | | Per-Host Resources: mem-estimate=0B mem-reservation=0B
> thread-reservation=1 |
> | PLAN-ROOT SINK
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | |
> |
> | 04:EXCHANGE [UNPARTITIONED]
> |
> | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | | tuple-ids=0,1N row-size=94B cardinality=589658570
> |
> | | in pipelines: 00(GETNEXT)
> |
> | |
> |
> | F00:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |
> | Per-Host Resources: mem-estimate=33.94MB mem-reservation=1.95MB
> thread-reservation=2 |
> | 02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]
> |
> | | hash predicates: l.c1 = r.c1, l.c2 = r.c2, l.c3 = r.c3
> |
> | | fk/pk conjuncts: l.c1 = r.c1, l.c2 = r.c2, l.c3 = r.c3
> |
> | | mem-estimate=1.94MB mem-reservation=1.94MB spill-buffer=64.00KB
> thread-reservation=0 |
> | | tuple-ids=0,1N row-size=94B cardinality=589658570
> |
> | | in pipelines: 00(GETNEXT), 01(OPEN)
> |
> | |
> |
> | |--03:EXCHANGE [BROADCAST]
> |
> | | | mem-estimate=0B mem-reservation=0B thread-reservation=0
> |
> | | | tuple-ids=1 row-size=47B cardinality=2510
> |
> | | | in pipelines: 01(GETNEXT)
> |
> | | |
> |
> | | F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
> |
> | | Per-Host Resources: mem-estimate=32.00MB mem-reservation=8.00KB
> thread-reservation=2 |
> | | 01:SCAN HDFS [cdh72724.test_right r, RANDOM]
> |
> | | partitions=1/1 files=1 size=8B
> |
> | | stored statistics:
> |
> | | table: rows=2510 size=8B
> |
> | | columns: all
> |
> | | extrapolated-rows=disabled max-scan-range-rows=2510
> |
> | | mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1
> |
> | | tuple-ids=1 row-size=47B cardinality=2510
> |
> | | in pipelines: 01(GETNEXT)
> |
> | |
> |
> | 00:SCAN HDFS [cdh72724.test_left l, RANDOM]
> |
> | partitions=1/1 files=1 size=8B
> |
> | stored statistics:
> |
> | table: rows=589658570 size=8B
> |
> | columns: all
> |
> | extrapolated-rows=disabled max-scan-range-rows=589658570
> |
> | mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1
> |
> | tuple-ids=0 row-size=47B cardinality=589658570
> |
> | in pipelines: 00(GETNEXT)
> |
> +-----------------------------------------------------------------------------------------+
> Fetched 48 row(s) in 0.01s
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]