[
https://issues.apache.org/jira/browse/IMPALA-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733736#comment-16733736
]
Paul Rogers commented on IMPALA-8024:
-------------------------------------
Another example, {{functional_hbase.stringids}}:
{noformat}
Query: show table stats stringids
+-----------------+--------------+------------+--------+
| Region Location | Start RowKey | Est. #Rows | Size |
+-----------------+--------------+------------+--------+
| localhost | | 10 | 0B |
| localhost | 1 | 4295 | 1.00MB |
| localhost | 3 | 4267 | 1.00MB |
| localhost | 5 | 4292 | 1.00MB |
| localhost | 7 | 4290 | 1.00MB |
| localhost | 9 | 10 | 0B |
| Total | | 17164 | 4.00MB |
+-----------------+--------------+------------+--------+
Query: show column stats stringids
+-----------------+-----------+------------------+--------+----------+-------------------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size
|
+-----------------+-----------+------------------+--------+----------+-------------------+
| id | STRING | 10000 | 0 | 4 |
3.888999938964844 |
...
select count(*) from stringids
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
{noformat}
Here, {{id}} is unique, so its NDV reflects row count at the time of gathering
stats. But, the estimated row count is 17K. Actual row count is 10K, same as
the NDV in stats.
> HBase table cardinality estimates are wrong
> -------------------------------------------
>
> Key: IMPALA-8024
> URL: https://issues.apache.org/jira/browse/IMPALA-8024
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.1.0
> Reporter: Paul Rogers
> Priority: Major
>
> IMPALA-8021 added cardinality estimates to EXPLAIN plan output. Running some
> of our {{PlannerTest}} files revealed that our HBase cardinality estimates
> are very poor, even for our simple test tables. For example, for
> {{functional_hbase.alltypessmall}}:
> {{count\(*)}} tells us that there are 100 rows:
> {noformat}
> select count(*) from functional_hbase.alltypessmall
> +----------+
> | count(*) |
> +----------+
> | 100 |
> +----------+
> {noformat}
> Table stats claim that there are only 60 rows:
> {noformat}
> show table stats functional_hbase.alltypessmall;
> +-----------------+--------------+------------+------+
> | Region Location | Start RowKey | Est. #Rows | Size |
> +-----------------+--------------+------------+------+
> | localhost | | 10 | 0B |
> | localhost | 1 | 10 | 0B |
> | localhost | 3 | 10 | 0B |
> | localhost | 5 | 10 | 0B |
> | localhost | 7 | 10 | 0B |
> | localhost | 9 | 10 | 0B |
> | Total | | 60 | 0B |
> +-----------------+--------------+------------+------+
> {noformat}
> The NDV stats show that there must be at least 100 rows:
> {noformat}
> show column stats functional_hbase.alltypessmall
> +-----------------+-----------+------------------+--------+----------+----------+
> | Column | Type | #Distinct Values | #Nulls | Max Size | Avg
> Size |
> +-----------------+-----------+------------------+--------+----------+----------+
> | id | INT | 99 | 0 | 4 | 4
> |
> ...
> | timestamp_col | TIMESTAMP | 100 | 0 | 16 | 16
> |
> ...
> +-----------------+-----------+------------------+--------+----------+----------+
> {noformat}
> Planning a query, the most critical part, thinks there are only 50 rows:
> {noformat}
> select *
> from functional.alltypesagg join functional_hbase.alltypessmall using (id,
> int_col)
> |--01:SCAN HBASE [functional_hbase.alltypessmall]
> | row-size=89B cardinality=50
> {noformat}
> We need a more reliable estimate.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]