[
https://issues.apache.org/jira/browse/IMPALA-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shant Hovsepian updated IMPALA-2416:
------------------------------------
Description:
As a stepping stone to using Histograms for more accurate cardinality
estimation build a uni-formally distributed histogram using Min, Max, Distinct
count & row count for better estimation of joins and filters.
For a table with the following stats this what Impala estimates
{code}
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |
Incremental stats | Location |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| 1500000 | 2 | 54.93MB | NOT CACHED | NOT CACHED | PARQUET |
false | hdfs://localhost:20500/test-warehouse/tpch.orders_parquet |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
{code}
{code}
+-----------------+---------------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg
Size |
+-----------------+---------------+------------------+--------+----------+-------------------+
| o_orderkey | BIGINT | 1563438 | -1 | 8 | 8
|
| o_custkey | BIGINT | 98390 | -1 | 8 | 8
|
| o_orderstatus | STRING | 3 | -1 | 1 | 1
|
| o_totalprice | DECIMAL(12,2) | 1438190 | -1 | 8 | 8
|
| o_orderdate | STRING | 2468 | -1 | 10 | 10
|
| o_orderpriority | STRING | 5 | -1 | 15 |
8.399886131286621 |
| o_clerk | STRING | 1006 | -1 | 15 | 15
|
| o_shippriority | INT | 1 | -1 | 4 | 4
|
| o_comment | STRING | 1388613 | -1 | 78 |
48.51387023925781 |
{code}
{code}
| Condition | estimate |Actual|
| o_orderkey in (1,2,3,4) |4|4|
| o_orderkey between 1 and 4 | 15,000 | 4 |
| o_orderkey <= 4 and o_orderkey >= 1 | 15,000 | 4|
| o_orderkey <= 1500000 and o_orderkey >= 1| 15,000 | 375,000|
+-----------------+---------------+------------------+
was:
As a stepping stone to using Histograms for more accurate cardinality
estimation build a uni-formally distributed histogram using Min, Max, Distinct
count & row count for better estimation of joins and filters.
For a table with the following stats this what Impala estimates
{code}
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |
Incremental stats | Location |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
| 1500000 | 2 | 54.93MB | NOT CACHED | NOT CACHED | PARQUET |
false | hdfs://localhost:20500/test-warehouse/tpch.orders_parquet |
+---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
{code}
{code}
+-----------------+---------------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg
Size |
+-----------------+---------------+------------------+--------+----------+-------------------+
| o_orderkey | BIGINT | 1563438 | -1 | 8 | 8
|
| o_custkey | BIGINT | 98390 | -1 | 8 | 8
|
| o_orderstatus | STRING | 3 | -1 | 1 | 1
|
| o_totalprice | DECIMAL(12,2) | 1438190 | -1 | 8 | 8
|
| o_orderdate | STRING | 2468 | -1 | 10 | 10
|
| o_orderpriority | STRING | 5 | -1 | 15 |
8.399886131286621 |
| o_clerk | STRING | 1006 | -1 | 15 | 15
|
| o_shippriority | INT | 1 | -1 | 4 | 4
|
| o_comment | STRING | 1388613 | -1 | 78 |
48.51387023925781 |
{code}
{code}
| Condition | estimate |Actual|
| o_orderkey in (1,2,3,4) |4|4|
| o_orderkey between 1 and 4 | 15,000 | 4 |
| o_orderkey <= 4 and o_orderkey >= 1 | 15,000 | 4|
| o_orderkey <= 1500000 and o_orderkey >= 1| 15,000 | 375,000|
+-----------------+---------------+------------------+
> Use Min, Max, Distinct count & row count to create a uniformly distributed
> histogram for better Cardinality estimation
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: IMPALA-2416
> URL: https://issues.apache.org/jira/browse/IMPALA-2416
> Project: IMPALA
> Issue Type: New Feature
> Components: Frontend
> Affects Versions: Impala 2.3.0
> Reporter: Mostafa Mokhtar
> Priority: Minor
> Labels: performance
>
> As a stepping stone to using Histograms for more accurate cardinality
> estimation build a uni-formally distributed histogram using Min, Max,
> Distinct count & row count for better estimation of joins and filters.
> For a table with the following stats this what Impala estimates
> {code}
> +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
> | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format |
> Incremental stats | Location
> |
> +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
> | 1500000 | 2 | 54.93MB | NOT CACHED | NOT CACHED | PARQUET |
> false | hdfs://localhost:20500/test-warehouse/tpch.orders_parquet
> |
> +---------+--------+---------+--------------+-------------------+---------+-------------------+-----------------------------------------------------------+
> {code}
> {code}
> +-----------------+---------------+------------------+--------+----------+-------------------+
> | Column | Type | #Distinct Values | #Nulls | Max Size |
> Avg Size |
> +-----------------+---------------+------------------+--------+----------+-------------------+
> | o_orderkey | BIGINT | 1563438 | -1 | 8 | 8
> |
> | o_custkey | BIGINT | 98390 | -1 | 8 | 8
> |
> | o_orderstatus | STRING | 3 | -1 | 1 | 1
> |
> | o_totalprice | DECIMAL(12,2) | 1438190 | -1 | 8 | 8
> |
> | o_orderdate | STRING | 2468 | -1 | 10 | 10
> |
> | o_orderpriority | STRING | 5 | -1 | 15 |
> 8.399886131286621 |
> | o_clerk | STRING | 1006 | -1 | 15 | 15
> |
> | o_shippriority | INT | 1 | -1 | 4 | 4
> |
> | o_comment | STRING | 1388613 | -1 | 78 |
> 48.51387023925781 |
> {code}
> {code}
> | Condition | estimate |Actual|
> | o_orderkey in (1,2,3,4) |4|4|
> | o_orderkey between 1 and 4 | 15,000 | 4 |
> | o_orderkey <= 4 and o_orderkey >= 1 | 15,000 | 4|
> | o_orderkey <= 1500000 and o_orderkey >= 1| 15,000 | 375,000|
> +-----------------+---------------+------------------+
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]