[ 
https://issues.apache.org/jira/browse/IMPALA-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-8045:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Rollup of Smaller Join Cardinality Issues
> -----------------------------------------
>
>                 Key: IMPALA-8045
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8045
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> The work to review join cardinality has found some major issues recorded as 
> JIRA tickets. This ticket records a number of smaller issues. Some of these 
> issues are a bit tricky because they appear only when some of the other 
> issues are resolved. Reporting them directly could be misleading.
> h4. ScanNode confusion between table and scan input cardinality
> The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field 
> used by join calculations as a proxy for the table size. However, the actual 
> scan node implementations set the {{inputCardinality_}} to the estimated 
> number of rows *read* by the scan, which is useful when understanding the 
> physical scan structure. But, for joins, we need the base table cardinality.
> For example, the join may use the input cardinality to understand the 
> reduction in rows due to filters in order to adjust the NDV of key columns. 
> But, since the input cardinality is the scan count, not the table row count, 
> the math does not work out.
> The solution is to clarify the code to separate the idea of scan count vs. 
> base table row count.
> h4. Selectivity Confusion
> Similarly, each node computes its selectivity. However, the selectivity is 
> only for those predicates that will be applied via a projection. Predicates 
> that can be applied because of partition pruning (HDFS), key range pruning 
> (HBase) and so on do not "count". While this produces accurate execution 
> estimates, it is not helpful for join planning.
> In join planning, we need to know the number of filtered rows relative to the 
> total table cardinality. This allows us to adjust HDV key cardinality in 
> order to estimate the number of rows produced by the join.
> Using the partial selectivity, or partial input cardinality (above issue) 
> causes inaccurate key cardinality adjustments and incorrect join cardinality 
> estimates.
> h4. Join Node Does not Apply Selectivity from Its Predicates
> A join node can have "additional predicates" applied after creating a join 
> row. Accurate estimation of join cardinality must include the selectivity 
> from those predicates, but is not currently done. Perhaps because such 
> predicates, in the current estimation scheme, always produce an estimated 
> selectivity of .1. This will be more important as we add more realistic 
> estimates.
> h4. Use Double, not Long for Cardinality Values
> In scan nodes, row counts can be reasonable numbers and a Java {{long}} is 
> fine. But, once one starts computing join cardinalities, values can grow 
> fast, especially for cross joins. The code currently has special checks to 
> limit products to {{Long.MAX_VALUE}}. While this solves the overflow issue, 
> it has undesirable downstream affects. First, it throws of selectivity 
> calculations since the reported cardinality is not the real cardinality. 
> Second, it requires special math calls whenever we multiply cardinalities.
> Much simper to work with a {{double}}. When values get large, the extra 
> precision from a integer value is completely lost in the noise of assumptions 
> and estimations.
> h4. Revisit Cardinality Calcs for Join Nodes
> The method {{JoinNode.computeStats()}} is a bit muddled. It starts by 
> computing cardinality depending on the major type family (semi-join, 
> inner/outer join, cross join). It then revises those calcs based on the 
> specific join type. This makes it very hard to follow the logic case we have 
> to follow two distinct blocks of code. There is also redundancy. The cross 
> join cardinality is calculated twice, for example.
> Refactor to have a cardinality/selectivity calculation per join type.
> h4. Disallow Unknown Cardinality
> Multiple nodes can produce a cardinality of -1 (unknown). Since it is 
> impossible to plan based on an unknown cardinality, we must have an estimate, 
> however good or bad. For cases where we have no stats, estimate cardinality 
> based on other factors. If we have no column NDV, perhaps guesstimate 
> something, or use an alternative join calculation that avoids the need for 
> NDV (while producing much cruder estimates.) However, refusing to play the 
> game at all is not helpful unless we choose to fail the query for lack of 
> stats.
> h4. Revisit Join Cardinality Limit
> The JoinNode has several methods that limit cardinality:
> {code:java}
>   public void computeStats(Analyzer analyzer) {
>     ...
>         cardinality_ = capCardinalityAtLimit(cardinality_);
>     ...
>   }
>   public boolean hasLimit() { return limit_ > -1; }
>   protected long capCardinalityAtLimit(long cardinality) {
>     if (hasLimit()) {
>       return capCardinalityAtLimit(cardinality, limit_);
>     }
>     return cardinality;
>   }
> {code}
> It is not clear when or why we apply a limit. Perhaps as part of {{LIMIT x}} 
> processing? Revisit if the limit is helpful, and remove it if not. Imposing a 
> limit throws off downstream joins, which is probably not what is wanted here.
> # Properly Handle Duplicated Filters in Outer Joins
> Consider the following query on TPC-H which picks 1/3 (newer version) or 1/10 
> (older version) of orders. It then joins them with their customers:
> {code:sql}
> select c.c_custkey, o.o_orderkey
> from tpch.customer c
> left outer join tpch.orders o on c.c_custkey = o.o_custkey
> where o.o_clerk < 'foo'
> {code}
> The plan produced places the WHERE clause predicate on both the scan *and* 
> join nodes:
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [RIGHT OUTER JOIN]
> |  hash predicates: o.o_custkey = c.c_custkey
> |  other predicates: o.o_clerk < 'foo'           <== Huh?
> |  row-size=51B cardinality=163.35K
> |
> |--00:SCAN HDFS [tpch.customer c]
> |     partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
> |
> 01:SCAN HDFS [tpch.orders o]
>    partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=495.00K
>    predicates: o.o_clerk < 'foo'                  <== Obvious location
> {noformat}
> The filter must be applied twice because the outer join will produce null 
> order rows. The filter won't match if the clerk field is null, so the 
> predicate is applied again.
> However, the meaning of the predicate in the join is "remove null rows" which 
> kind of means to undo the "outer". Given that, such a query is rather 
> meaningless.
> The point here is to properly account for the predicate. Simply applying the 
> {{sel(predicate)}} value twice is clearly wrong. We actually want to apply 
> the predicate only to those rows that were generated in the outer join to 
> avoid double-accounting.
> This is an obscure corner-case; the question is whether we need to account 
> for this kind of issue in multiple places. If we do,  we need a more 
> sophisticated set of data models to account for predicates than is currently 
> used.
> See test case in {{card-joins.test}} that references this ticket number for 
> an example.
> h4. Better Handling of OUTER JOIN with Column Filter
> Consider:
> {code:sql}
> SELECT c.c_name, o.o_orderkey
> FROM customers c RIGHT OUTER JOIN orders o
> WHERE c.c_name = 'Bob'
> {code}
> The query runs the "c.c_name = 'Bob'" predicate twice: once in the scan and a 
> second time in the join. Why in the join? To check if the null left 
> (customer) columns match the predicate (which they don't.) The revised code 
> handles this more-or-less correctly using the NDV of "c_name". But, doing so 
> is subtly wrong.
> We want to know the effect of doing the filtering past the join. We know that 
> all the rows that match do have the name "Bob", so the NDV should be 1. The 
> only non-"Bob" rows are the null rows inserted by the join. So, we should use 
> logic that is aware of this case to provide a better estimate.
> On the other hand, If the predicate where "WHERE c.c_name is null", the 
> results would be far different. So, the outer join logic should also be aware 
> of the meaning (null handling) of the predicate. In this case. all the 
> inserted null rows match, so we should *not* use the null count or NDV to 
> guess the filtering.
> Point is, this area is subtle and can't be brute-forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to