[ 
https://issues.apache.org/jira/browse/IMPALA-8042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873325#comment-17873325
 ] 

ASF subversion and git services commented on IMPALA-8042:
---------------------------------------------------------

Commit dcff9871328bcc327041c1e2128e858651266792 in impala's branch 
refs/heads/branch-4.4.1 from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=dcff98713 ]

IMPALA-8042: Assign BETWEEN selectivity for discrete-unique column

Impala frontend can not evaluate BETWEEN/NOT BETWEEN predicate directly.
It needs to transform a BetweenPredicate into a CompoundPredicate
consisting of upper bound and lower bound BinaryPredicate through
BetweenToCompoundRule.java. The BinaryPredicate can then be pushed down
or rewritten into other form by another expression rewrite rule.
However, the selectivity of BetweenPredicate or its derivatives remains
unassigned and often collapses with other unknown selectivity predicates
to have collective selectivity equals Expr.DEFAULT_SELECTIVITY (0.1).

This patch adds a narrow optimization of BetweenPredicate selectivity
when the following criteria are met:

1. The BetweenPredicate is bound to a slot reference of a single column
   of a table.
2. The column type is discrete, such as INTEGER or DATE.
3. The column stats are available.
4. The column is sufficiently unique based on available stats.
5. The BETWEEN/NOT BETWEEN predicate is in good form (lower bound value
   <= upper bound value).
6. The final calculated selectivity is less than or equal to
   Expr.DEFAULT_SELECTIVITY.

If these criteria are unmet, the Planner will revert to the old
behavior, which is letting the selectivity unassigned.

Since this patch only target BetweenPredicate over unique column, the
following query will still have the default scan selectivity (0.1):

select count(*) from tpch.customer c
where c.c_custkey >= 1234 and c.c_custkey <= 2345;

While this equivalent query written with BETWEEN predicate will have
lower scan selectivity:

select count(*) from tpch.customer c
where c.c_custkey between 1234 and 2345;

This patch calculates the BetweenPredicate selectivity during
transformation at BetweenToCompoundRule.java. The selectivity is
piggy-backed into the resulting CompoundPredicate and BinaryPredicate as
betweenSelectivity_ field, separate from the selectivity_ field.
Analyzer.getBoundPredicates() is modified to prioritize the derived
BinaryPredicate over ordinary BinaryPredicate in its return value to
prevent the derived BinaryPredicate from being eliminated by a matching
ordinary BinaryPredicate.

Testing:
- Add table functional_parquet.unique_with_nulls.
- Add FE tests in ExprCardinalityTest#testBetweenSelectivity,
  ExprCardinalityTest#testNotBetweenSelectivity, and
  PlannerTest#testScanCardinality.
- Pass core tests.

Change-Id: Ib349d97349d1ee99788645a66be1b81749684d10
Reviewed-on: http://gerrit.cloudera.org:8080/21377
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Better selectivity estimate for BETWEEN
> ---------------------------------------
>
>                 Key: IMPALA-8042
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8042
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Assignee: Riza Suminto
>            Priority: Minor
>             Fix For: Impala 4.5.0
>
>
> The analyzer rewrites a BETWEEN expression into a pair of inequalities.  
> IMPALA-8037 explains that the planner then groups all such non-quality 
> conditions together and assigns a selectivity of 0.1. IMPALA-8031 explains 
> that the analyzer should handle inequalities better.
> BETWEEN is a special case and informs the final result. If we assume a 
> selectivity of s for inequality, then BETWEEN should be something like s/2. 
> The intuition is that if c >= x includes, say, ⅓ of values, and c <= y 
> includes a third of values, then c BETWEEN x AND y should be a narrower set 
> of values, say ⅙.
> [Ramakrishnan an 
> Gherke|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html\
>  recommend 0.4 for between, 0.3 for inequality, and 0.3^2 = 0.09 for the 
> general expression x <= c AND c <= Y. Note the discrepancy between the 
> compound inequality case and the BETWEEN case, likely reflecting the 
> additional information we obtain when the user chooses to use BETWEEN.
> To implement a special BETWEEN selectivity in Impala, we must remember the 
> selectivity of BETWEEN during the rewrite to a compound inequality.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to