[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Vineet Garg (JIRA) Sun, 14 Oct 2018 11:30:10 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-20660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vineet Garg updated HIVE-20660:
-------------------------------
    Status: Patch Available  (was: Open)

> Group by statistics estimation could be improved by bounding the total number 
> of rows to source table
> -----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-20660
>                 URL: https://issues.apache.org/jira/browse/HIVE-20660
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>         Attachments: HIVE-20660.1.patch, HIVE-20660.2.patch, 
> HIVE-20660.3.patch, HIVE-20660.4.patch, HIVE-20660.5.patch, 
> HIVE-20660.6.patch, HIVE-20660.7.patch
>
>
> Currently the stats for group by is estimated by taking product of NDVs of 
> all the keys and bounding it by the number of rows of its input. This bound 
> could be improved by using the source table instead of immediate input, the 
> insight in this case is that cardinality/ndvs of a table can not go beyond 
> the original (outer joins will only add NULLs thereby increasing the 
> cardinality by 1).
> Note that the assumption here is that group by keys all belong to the same 
> source table/input.
> This will improve the estimation in situations where group by is executed 
> after joins wherein Hive could end up estimating the number of rows.
> *Reproducer*
> {code:sql}
> set hive.stats.fetch.column.stats=true;
> create table t1(i int, j int);
> alter table t1 update statistics set('numRows'='10000', 
> 'rawDataSize'='18000');
> alter table t1 update statistics for column i 
> set('numDVs'='2500','numNulls'='50','highValue'='1000','lowValue'='0');
> alter table t1 update statistics for column j 
> set('numDVs'='500','numNulls'='30','highValue'='100','lowValue'='50');
> create table t2(i2 int, j2 int);
> alter table t2 update statistics set('numRows'='100000000', 
> 'rawDataSize'='10000');
> alter table t2 update statistics for column i2 
> set('numDVs'='10000000','numNulls'='0','highValue'='8000','lowValue'='0');
> alter table t2 update statistics for column j2 
> set('numDVs'='10','numNulls'='0','highValue'='800','lowValue'='-1');
> explain select count (1) from t1,t2
>       where t1.j=t2.i2 
> group by t1.i, t1.j;
> {code}
> {code:sql}
> Reducer 2
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 keys:
>                   0 _col1 (type: int)
>                   1 _col0 (type: int)
>                 outputColumnNames: _col0, _col1
>                 Statistics: Num rows: 99700 Data size: 797288 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                 Group By Operator
>                   aggregations: count()
>                   keys: _col0 (type: int), _col1 (type: int)
>                   mode: hash
>                   outputColumnNames: _col0, _col1, _col2
>                   Statistics: Num rows: 49850 Data size: 797448 Basic stats: 
> COMPLETE Column stats: COMPLETE <==========
>                   Reduce Output Operator
>                     key expressions: _col0 (type: int), _col1 (type: int)
>                     sort order: ++
>                     Map-reduce partition columns: _col0 (type: int), _col1 
> (type: int)
>                     Statistics: Num rows: 49850 Data size: 797448 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                     value expressions: _col2 (type: bigint)
>   .....................
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-20660) Group by statistics estimation could be improved by bounding the total number of rows to source table

Reply via email to