Paul Rogers created IMPALA-8032:
-----------------------------------

             Summary: Gather minimum, maximum values to better estimate 
inequality selectivity
                 Key: IMPALA-8032
                 URL: https://issues.apache.org/jira/browse/IMPALA-8032
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers


A query may contain an inequality predicate. TPC-H has many such as 
{{l_shipdate <= '1998-09-02'}}.

The planer must know the selectivity of each predicate applied to filter a 
table. Inequalities are impossible to estimate from just the NDV value 
available in the catalog. As a result, most systems assume some value around .3 
or .4. (Textbooks recommend .3).

The query literature notes that the best way to estimate an inequality is with 
histograms. The literature also knows a cheaper alternative:

* Assume uniform value distribution, and
* Gather the minimum and maximum column values.

Given this it is easy to estimate an inequality as:

{noformat}
sel(c < x) = (x - min(c)) / (max(c) - min(c))

sel(c > x) = (max(c) - x) / (max(c) - min(c))
{noformat}

The cost is just two extra values per column rather than the full cost of a 
histogram.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to