Shohei Okumiya created HIVE-28363:
-------------------------------------

             Summary: Improve heuristics of FilterStatsRule without column stats
                 Key: HIVE-28363
                 URL: https://issues.apache.org/jira/browse/HIVE-28363
             Project: Hive
          Issue Type: Improvement
          Components: Statistics
    Affects Versions: 4.0.0
            Reporter: Shohei Okumiya
            Assignee: Shohei Okumiya


HIVE-19647 enables Hive to estimate the selectivity of IN operators using new 
column stats. It also changed the heuristics when column stats were 
unavailable. In my observation, the original one would give a better estimation 
in that case.

 

This is an example. The table has ten rows and no column stats on `id`.

 
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
...
+------------------------+-----------------------------+
|    column_property     |            value            |
+------------------------+-----------------------------+
| col_name               | id                          |
| data_type              | int                         |
| min                    |                             |
| max                    |                             |
| num_nulls              |                             |
| distinct_count         |                             |
| avg_col_len            |                             |
| max_col_len            |                             |
| num_trues              |                             |
| num_falses             |                             |
| bit_vector             |                             |
| comment                | from deserializer           |
| COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
+------------------------+-----------------------------+{code}
With a single needle, the estimated number becomes 10 * 0.5 = 5 because of the 
fallback heuristics.

 

 
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1);
...
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id = 1) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id = 1) (type: boolean) |
|                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE | {code}
The size is estimated to be the original size with two or more needles. That's 
because the heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. But I 
believe users will observe some reduction when they use IN.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1, 2);
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id) IN (1, 2) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id) IN (1, 2) (type: boolean) |
|                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE | {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to