[
https://issues.apache.org/jira/browse/IMPALA-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers reassigned IMPALA-8045:
-----------------------------------
Assignee: (was: Paul Rogers)
> Rollup of Smaller Join Cardinality Issues
> -----------------------------------------
>
> Key: IMPALA-8045
> URL: https://issues.apache.org/jira/browse/IMPALA-8045
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Major
>
> The work to review join cardinality has found some major issues recorded as
> JIRA tickets. This ticket records a number of smaller issues. Some of these
> issues are a bit tricky because they appear only when some of the other
> issues are resolved. Reporting them directly could be misleading.
> h4. ScanNode confusion between table and scan input cardinality
> The {{ScanNode}} class in the scanner contains an {{inputCardinality_}} field
> used by join calculations as a proxy for the table size. However, the actual
> scan node implementations set the {{inputCardinality_}} to the estimated
> number of rows *read* by the scan, which is useful when understanding the
> physical scan structure. But, for joins, we need the base table cardinality.
> For example, the join may use the input cardinality to understand the
> reduction in rows due to filters in order to adjust the NDV of key columns.
> But, since the input cardinality is the scan count, not the table row count,
> the math does not work out.
> The solution is to clarify the code to separate the idea of scan count vs.
> base table row count.
> h4. Selectivity Confusion
> Similarly, each node computes its selectivity. However, the selectivity is
> only for those predicates that will be applied via a projection. Predicates
> that can be applied because of partition pruning (HDFS), key range pruning
> (HBase) and so on do not "count". While this produces accurate execution
> estimates, it is not helpful for join planning.
> In join planning, we need to know the number of filtered rows relative to the
> total table cardinality. This allows us to adjust HDV key cardinality in
> order to estimate the number of rows produced by the join.
> Using the partial selectivity, or partial input cardinality (above issue)
> causes inaccurate key cardinality adjustments and incorrect join cardinality
> estimates.
> h4. Join Node Does not Apply Selectivity from Its Predicates
> A join node can have "additional predicates" applied after creating a join
> row. Accurate estimation of join cardinality must include the selectivity
> from those predicates, but is not currently done. Perhaps because such
> predicates, in the current estimation scheme, always produce an estimated
> selectivity of .1. This will be more important as we add more realistic
> estimates.
> h4. Use Double, not Long for Cardinality Values
> In scan nodes, row counts can be reasonable numbers and a Java {{long}} is
> fine. But, once one starts computing join cardinalities, values can grow
> fast, especially for cross joins. The code currently has special checks to
> limit products to {{Long.MAX_VALUE}}. While this solves the overflow issue,
> it has undesirable downstream affects. First, it throws of selectivity
> calculations since the reported cardinality is not the real cardinality.
> Second, it requires special math calls whenever we multiply cardinalities.
> Much simper to work with a {{double}}. When values get large, the extra
> precision from a integer value is completely lost in the noise of assumptions
> and estimations.
> h4. Revisit Cardinality Calcs for Join Nodes
> The method {{JoinNode.computeStats()}} is a bit muddled. It starts by
> computing cardinality depending on the major type family (semi-join,
> inner/outer join, cross join). It then revises those calcs based on the
> specific join type. This makes it very hard to follow the logic case we have
> to follow two distinct blocks of code. There is also redundancy. The cross
> join cardinality is calculated twice, for example.
> Refactor to have a cardinality/selectivity calculation per join type.
> h4. Disallow Unknown Cardinality
> Multiple nodes can produce a cardinality of -1 (unknown). Since it is
> impossible to plan based on an unknown cardinality, we must have an estimate,
> however good or bad. For cases where we have no stats, estimate cardinality
> based on other factors. If we have no column NDV, perhaps guesstimate
> something, or use an alternative join calculation that avoids the need for
> NDV (while producing much cruder estimates.) However, refusing to play the
> game at all is not helpful unless we choose to fail the query for lack of
> stats.
> h4. Revisit Join Cardinality Limit
> The JoinNode has several methods that limit cardinality:
> {code:java}
> public void computeStats(Analyzer analyzer) {
> ...
> cardinality_ = capCardinalityAtLimit(cardinality_);
> ...
> }
> public boolean hasLimit() { return limit_ > -1; }
> protected long capCardinalityAtLimit(long cardinality) {
> if (hasLimit()) {
> return capCardinalityAtLimit(cardinality, limit_);
> }
> return cardinality;
> }
> {code}
> It is not clear when or why we apply a limit. Perhaps as part of {{LIMIT x}}
> processing? Revisit if the limit is helpful, and remove it if not. Imposing a
> limit throws off downstream joins, which is probably not what is wanted here.
> # Properly Handle Duplicated Filters in Outer Joins
> Consider the following query on TPC-H which picks 1/3 (newer version) or 1/10
> (older version) of orders. It then joins them with their customers:
> {code:sql}
> select c.c_custkey, o.o_orderkey
> from tpch.customer c
> left outer join tpch.orders o on c.c_custkey = o.o_custkey
> where o.o_clerk < 'foo'
> {code}
> The plan produced places the WHERE clause predicate on both the scan *and*
> join nodes:
> {noformat}
> PLAN-ROOT SINK
> |
> 02:HASH JOIN [RIGHT OUTER JOIN]
> | hash predicates: o.o_custkey = c.c_custkey
> | other predicates: o.o_clerk < 'foo' <== Huh?
> | row-size=51B cardinality=163.35K
> |
> |--00:SCAN HDFS [tpch.customer c]
> | partitions=1/1 files=1 size=23.08MB row-size=8B cardinality=150.00K
> |
> 01:SCAN HDFS [tpch.orders o]
> partitions=1/1 files=1 size=162.56MB row-size=43B cardinality=495.00K
> predicates: o.o_clerk < 'foo' <== Obvious location
> {noformat}
> The filter must be applied twice because the outer join will produce null
> order rows. The filter won't match if the clerk field is null, so the
> predicate is applied again.
> However, the meaning of the predicate in the join is "remove null rows" which
> kind of means to undo the "outer". Given that, such a query is rather
> meaningless.
> The point here is to properly account for the predicate. Simply applying the
> {{sel(predicate)}} value twice is clearly wrong. We actually want to apply
> the predicate only to those rows that were generated in the outer join to
> avoid double-accounting.
> This is an obscure corner-case; the question is whether we need to account
> for this kind of issue in multiple places. If we do, we need a more
> sophisticated set of data models to account for predicates than is currently
> used.
> See test case in {{card-joins.test}} that references this ticket number for
> an example.
> h4. Better Handling of OUTER JOIN with Column Filter
> Consider:
> {code:sql}
> SELECT c.c_name, o.o_orderkey
> FROM customers c RIGHT OUTER JOIN orders o
> WHERE c.c_name = 'Bob'
> {code}
> The query runs the "c.c_name = 'Bob'" predicate twice: once in the scan and a
> second time in the join. Why in the join? To check if the null left
> (customer) columns match the predicate (which they don't.) The revised code
> handles this more-or-less correctly using the NDV of "c_name". But, doing so
> is subtly wrong.
> We want to know the effect of doing the filtering past the join. We know that
> all the rows that match do have the name "Bob", so the NDV should be 1. The
> only non-"Bob" rows are the null rows inserted by the join. So, we should use
> logic that is aware of this case to provide a better estimate.
> On the other hand, If the predicate where "WHERE c.c_name is null", the
> results would be far different. So, the outer join logic should also be aware
> of the meaning (null handling) of the predicate. In this case. all the
> inserted null rows match, so we should *not* use the null count or NDV to
> guess the filtering.
> Point is, this area is subtle and can't be brute-forced.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]