Paul Rogers created IMPALA-8018:
-----------------------------------

             Summary: Unify cardinality math for PK/FK and generic cases
                 Key: IMPALA-8018
                 URL: https://issues.apache.org/jira/browse/IMPALA-8018
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers


IMPALA-8014 discussed the math for FK/PK (M:1) cardinality estimation (and a 
bug in that logic.) IMPALA-8015 similarly discussed the math for generic (M:N) 
cardinality estimation (and a bug.)

The math in those two tickets suggests that the code is a bit overly complex: 
that there are not, in fact, two cases ("PK/FK" and "generic") but just one, at 
least for cardinality estimation.

Consider the generic formula worked out in IMPALA-8015:

{noformat}
p(match) = ndv(r.b) / ndv(l.a) if ndv(r.b) < ndv(l.a),
           ndv(l.a) / ndv(r.b) otherwise
|l >< r| = |scan(l)| * |scan(r)| / ndv(r.b) * p(match)
{noformat}

Intuitively, each row from the left (probe) side matches a group of rows on the 
right hand (build ) side. How big is that group? It is the total of right-hand 
rows divided by the ndv of the right hand key (assuming uniform key 
distribution). That is:

{noformat}
|r 𝜎 r.b=x| = |r| / ndv(r.b)
{noformat}

A query typically applies filters to the right table. Assuming a uniform 
reduction via a non-key filter:

{noformat}
|scan(r 𝜎 r.b=x)| = |scan(r)| / ndv(r.b)
{noformat}

The formula is the "the cardinality of a selection of table r where r.b equals 
some key value x". That's a mouthful, so let's abbreviate:

{noformat}
|r.b|       = |r| / ndv(r.b)
|scan(r.b)| = |scan(r)| / ndv(r.b)
{noformat}

One thing to notice is that the same NDV value appears in both cases. Applying 
a filter should reduce the NDV. But, we are concerned about the M:N case, so 
the right-hand filter simply reduces the size of each group, but not the number 
of groups. Of course, if filtering is severe enough, it will eliminate entire 
groups. This shows up because the ratio `|scan(r)| / ndv(r.b)` will become 
fractional and represent the probability that any member of the group is 
available for join. This means that the single formula works for all cases: 
full table, partial table and very heavy selection.

We can now rewrite the generic-case equation from IMPALA-8015:

{noformat}
|l >< r| = |scan(l)| * |scan(r)| / ndv(r.b) * p(match)
         = |scan(l)| * |scan(r.b)| * p(match)
{noformat}

The probability term handles the case that there are more keys on one side than 
the other so that some rows won't find a match. We assume that if the same 
number of keys exist on both sides, that they are the same keys. (If not, the 
user would likely not have bothered to create a join predicate that equates 
them.)

What happens as the number of right-hand keys increases? The group size 
decreases. The largest that the {{ndv(r.b)}} can get is {{|r|}}, so in the 
limit the group size goes to 1.

Let's assume we know from metadata that {{ndv(r.b) = |r|}}, then we have a 
group size of 1 and we can ignore the group size term, so:

{noformat}
|l >< r| = |scan(l)| * |scan(r.b)| * p(match)

|r.b| = 1

|l >< r| = |scan(l)| * |scan(r.b)| * p(match)
         = |scan(l)| * 1 * p(match)
         = |scan(l)| * p(match)
{noformat}

Now, let's recall the formula used in the 1:M ("PK/FK") case:

{noformat}
|d >< m| = |d| * p(match)
{noformat}

They are identical.

But, what about that {{p(match)}} term, perhaps that is different in the two 
cases. In the one case, it tells us a M:N probability, in the other it tells us 
the M:1 probability. Let's check:

{noformat}
p(match) = ndv(d.fk) / ndv(m.pk) if ndv(d.fk) < ndv(m.pk),
           ndv(m.pk) / ndv(d.fk) otherwise.
p(match) = ndv(r.b) / ndv(l.a) if ndv(r.b) < ndv(l.a),
           ndv(l.a) / ndv(r.b) otherwise
{noformat}

Except for a difference in names, the formulas are identical. This means that 
the M:1 case is automatically handled by the M:N case, with 1 simply a 
degenerate value for N.

This, in turn, means that Impala need not implement two cardinality estimation 
paths; a single path will suffice.

Let's double check.

How does the code pick the M:1 path? By considering if {{ndv(r.b) = |r|}}. If 
so, then {{r}} is actually {{m}} (a master table and {{l}} is {{d}} (a detail 
table) and we have the FK/PK case.

On the other hand, if {{ndv(r.b) < |r|}}, we have a M:N case, each row in {{l}} 
matches multiple rows in {{r}} called the generic case in Impala.

But, what if {{ndv(r.b)}} is less then {{|r|}} by a small amount, say 10%? That 
means each row in {{l}} will match mostly one row, but about 10% of the time it 
will match two. So, the generic formula for this case should be close to the 
FK/PK formula. What if the difference is 5% or 1%? The generic formula should 
match even closer to the FK/PK formula. Finally when the difference is 0%, the 
generic formula must converge with the FK/PK formula.

Said another way, as the M:N groups on the "N" side get ever smaller, they will 
converge on a group size of 1, which is the M:1 case.

Given all this, the M:1 (FK/PK) formula *must* be a special case of the more 
generic M:N (generic) formula.

The consequence is that the code should implement the math once, not twice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to