bharath v created IMPALA-7560:
---------------------------------
Summary: Better selectivity estimate for != (not equals) binary
predicate
Key: IMPALA-7560
URL: https://issues.apache.org/jira/browse/IMPALA-7560
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 2.12.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0,
Impala 2.13.0
Reporter: bharath v
Currently we use the default selectivity estimate for any binary predicate with
op other than EQ / NON_DISTINCT.
{noformat}
// Determine selectivity
// TODO: Compute selectivity for nested predicates.
// TODO: Improve estimation using histograms.
Reference<SlotRef> slotRefRef = new Reference<SlotRef>();
if ((op_ == Operator.EQ || op_ == Operator.NOT_DISTINCT)
&& isSingleColumnPredicate(slotRefRef, null)) {
long distinctValues = slotRefRef.getRef().getNumDistinctValues();
if (distinctValues > 0) {
selectivity_ = 1.0 / distinctValues;
selectivity_ = Math.max(0, Math.min(1, selectivity_));
}
}
{noformat}
This can give very conservative estimates. For example:
{noformat}
[localhost:21000] tpch> select * from nation where n_regionkey != 1;
[localhost:21000] tpch> summary;
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+
| Operator | #Hosts | Avg Time | Max Time | *#Rows* | *Est. #Rows* | Peak
Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+
| 00:SCAN HDFS | 1 | 3.32ms | 3.32ms | *20* | *3* | 143.00
KB | 16.00 MB | tpch.nation |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-------------+
[localhost:21000] tpch>
{noformat}
Ideally we could've inversed the selecitivity to 4/5 (=1 - 1/5) that can give
better estimate.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]