Paul Rogers created IMPALA-8032: ----------------------------------- Summary: Gather minimum, maximum values to better estimate inequality selectivity Key: IMPALA-8032 URL: https://issues.apache.org/jira/browse/IMPALA-8032 Project: IMPALA Issue Type: Improvement Components: Catalog Affects Versions: Impala 3.1.0 Reporter: Paul Rogers
A query may contain an inequality predicate. TPC-H has many such as {{l_shipdate <= '1998-09-02'}}. The planer must know the selectivity of each predicate applied to filter a table. Inequalities are impossible to estimate from just the NDV value available in the catalog. As a result, most systems assume some value around .3 or .4. (Textbooks recommend .3). The query literature notes that the best way to estimate an inequality is with histograms. The literature also knows a cheaper alternative: * Assume uniform value distribution, and * Gather the minimum and maximum column values. Given this it is easy to estimate an inequality as: {noformat} sel(c < x) = (x - min(c)) / (max(c) - min(c)) sel(c > x) = (max(c) - x) / (max(c) - min(c)) {noformat} The cost is just two extra values per column rather than the full cost of a histogram. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org