[jira] [Updated] (IMPALA-2416) Use Min, Max, Distinct count & row count to create a uniformly distributed histogram for better Cardinality estimation

Shant Hovsepian (Jira) Sat, 18 Jul 2020 13:31:02 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shant Hovsepian updated IMPALA-2416:
------------------------------------
    Description: 
 As a stepping stone to using Histograms for more accurate cardinality 
estimation build a uni-formally distributed histogram using  Min, Max, Distinct 
count & row count for better estimation of joins and filters. 

For a table with the following stats this what Impala estimates
{code}
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| #Rows   | #Files | Size    | Bytes Cached | Cache Replication | Format  | 
Incremental stats | Location                                                  |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| 1500000 | 2      | 54.93MB | NOT CACHED   | NOT CACHED        | PARQUET | 
false             | hdfs://localhost:20500/test-warehouse/tpch.orders_parquet |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
{code}
{code}
+-----------------+---------------+------------------+--------+----------+-------------------+
| Column          | Type          | #Distinct Values | #Nulls | Max Size | Avg 
Size          |
+-----------------+---------------+------------------+--------+----------+-------------------+
| o_orderkey      | BIGINT        | 1563438          | -1     | 8        | 8    
             |
| o_custkey       | BIGINT        | 98390            | -1     | 8        | 8    
             |
| o_orderstatus   | STRING        | 3                | -1     | 1        | 1    
             |
| o_totalprice    | DECIMAL(12,2) | 1438190          | -1     | 8        | 8    
             |
| o_orderdate     | STRING        | 2468             | -1     | 10       | 10   
             |
| o_orderpriority | STRING        | 5                | -1     | 15       | 
8.399886131286621 |
| o_clerk         | STRING        | 1006             | -1     | 15       | 15   
             |
| o_shippriority  | INT           | 1                | -1     | 4        | 4    
             |
| o_comment       | STRING        | 1388613          | -1     | 78       | 
48.51387023925781 |
{code}

{code}
| Condition          | estimate          |Actual|
|  o_orderkey in (1,2,3,4)      |4|4|
| o_orderkey between 1 and 4      | 15,000        | 4          |
| o_orderkey <= 4 and o_orderkey >= 1      | 15,000        | 4|
| o_orderkey <= 1500000 and o_orderkey >= 1| 15,000        | 375,000|
+-----------------+---------------+------------------+

  was:
As a stepping stone to using Histograms for more accurate cardinality 
estimation build a uni-formally distributed histogram using  Min, Max, Distinct 
count & row count for better estimation of joins and filters. 

For a table with the following stats this what Impala estimates
{code}
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| #Rows   | #Files | Size    | Bytes Cached | Cache Replication | Format  | 
Incremental stats | Location                                                  |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| 1500000 | 2      | 54.93MB | NOT CACHED   | NOT CACHED        | PARQUET | 
false             | hdfs://localhost:20500/test-warehouse/tpch.orders_parquet |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
{code}
{code}
+-----------------+---------------+------------------+--------+----------+-------------------+
| Column          | Type          | #Distinct Values | #Nulls | Max Size | Avg 
Size          |
+-----------------+---------------+------------------+--------+----------+-------------------+
| o_orderkey      | BIGINT        | 1563438          | -1     | 8        | 8    
             |
| o_custkey       | BIGINT        | 98390            | -1     | 8        | 8    
             |
| o_orderstatus   | STRING        | 3                | -1     | 1        | 1    
             |
| o_totalprice    | DECIMAL(12,2) | 1438190          | -1     | 8        | 8    
             |
| o_orderdate     | STRING        | 2468             | -1     | 10       | 10   
             |
| o_orderpriority | STRING        | 5                | -1     | 15       | 
8.399886131286621 |
| o_clerk         | STRING        | 1006             | -1     | 15       | 15   
             |
| o_shippriority  | INT           | 1                | -1     | 4        | 4    
             |
| o_comment       | STRING        | 1388613          | -1     | 78       | 
48.51387023925781 |
{code}

{code}
| Condition          | estimate          |Actual|
|  o_orderkey in (1,2,3,4)      |4|4|
| o_orderkey between 1 and 4      | 15,000        | 4          |
| o_orderkey <= 4 and o_orderkey >= 1      | 15,000        | 4|
| o_orderkey <= 1500000 and o_orderkey >= 1| 15,000        | 375,000|
+-----------------+---------------+------------------+


> Use Min, Max, Distinct count & row count to create a uniformly distributed 
> histogram for better Cardinality estimation
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-2416
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2416
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Frontend
>    Affects Versions: Impala 2.3.0
>            Reporter: Mostafa Mokhtar
>            Priority: Minor
>              Labels: performance
>
>  As a stepping stone to using Histograms for more accurate cardinality 
> estimation build a uni-formally distributed histogram using  Min, Max, 
> Distinct count & row count for better estimation of joins and filters. 
> For a table with the following stats this what Impala estimates
> {code}
> +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
> | #Rows   | #Files | Size    | Bytes Cached | Cache Replication | Format  | 
> Incremental stats | Location                                                  
> |
> +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
> | 1500000 | 2      | 54.93MB | NOT CACHED   | NOT CACHED        | PARQUET | 
> false             | hdfs://localhost:20500/test-warehouse/tpch.orders_parquet 
> |
> +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
> {code}
> {code}
> +-----------------+---------------+------------------+--------+----------+-------------------+
> | Column          | Type          | #Distinct Values | #Nulls | Max Size | 
> Avg Size          |
> +-----------------+---------------+------------------+--------+----------+-------------------+
> | o_orderkey      | BIGINT        | 1563438          | -1     | 8        | 8  
>                |
> | o_custkey       | BIGINT        | 98390            | -1     | 8        | 8  
>                |
> | o_orderstatus   | STRING        | 3                | -1     | 1        | 1  
>                |
> | o_totalprice    | DECIMAL(12,2) | 1438190          | -1     | 8        | 8  
>                |
> | o_orderdate     | STRING        | 2468             | -1     | 10       | 10 
>                |
> | o_orderpriority | STRING        | 5                | -1     | 15       | 
> 8.399886131286621 |
> | o_clerk         | STRING        | 1006             | -1     | 15       | 15 
>                |
> | o_shippriority  | INT           | 1                | -1     | 4        | 4  
>                |
> | o_comment       | STRING        | 1388613          | -1     | 78       | 
> 48.51387023925781 |
> {code}
> {code}
> | Condition          | estimate          |Actual|
> |  o_orderkey in (1,2,3,4)      |4|4|
> | o_orderkey between 1 and 4      | 15,000        | 4          |
> | o_orderkey <= 4 and o_orderkey >= 1      | 15,000        | 4|
> | o_orderkey <= 1500000 and o_orderkey >= 1| 15,000        | 375,000|
> +-----------------+---------------+------------------+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-2416) Use Min, Max, Distinct count & row count to create a uniformly distributed histogram for better Cardinality estimation

Reply via email to