[ 
https://issues.apache.org/jira/browse/IMPALA-8024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16733748#comment-16733748
 ] 

Paul Rogers commented on IMPALA-8024:
-------------------------------------

Even simpler: for the {{stringids}} table mentioned above, table stats report 
the row count as 10K. This number should be used for the raw table count 
number. The code uses other numbers for, it seems, effective row counts that 
reflect the number of rows to be scanned. Actually, it seems the code confuses 
the number of rows read with the table cardinality. See IMPALA-8045.

 

> HBase table cardinality estimates are wrong
> -------------------------------------------
>
>                 Key: IMPALA-8024
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8024
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> IMPALA-8021 added cardinality estimates to EXPLAIN plan output. Running some 
> of our {{PlannerTest}} files revealed that our HBase cardinality estimates 
> are very poor, even for our simple test tables. For example, for 
> {{functional_hbase.alltypessmall}}:
> {{count\(*)}} tells us that there are 100 rows:
> {noformat}
> select count(*) from functional_hbase.alltypessmall
> +----------+
> | count(*) |
> +----------+
> | 100      |
> +----------+
> {noformat}
> Table stats claim that there are only 60 rows:
> {noformat}
> show table stats functional_hbase.alltypessmall;
> +-----------------+--------------+------------+------+
> | Region Location | Start RowKey | Est. #Rows | Size |
> +-----------------+--------------+------------+------+
> | localhost       |              | 10         | 0B   |
> | localhost       | 1            | 10         | 0B   |
> | localhost       | 3            | 10         | 0B   |
> | localhost       | 5            | 10         | 0B   |
> | localhost       | 7            | 10         | 0B   |
> | localhost       | 9            | 10         | 0B   |
> | Total           |              | 60         | 0B   |
> +-----------------+--------------+------------+------+
> {noformat}
> The NDV stats show that there must be at least 100 rows:
> {noformat}
> show column stats functional_hbase.alltypessmall
> +-----------------+-----------+------------------+--------+----------+----------+
> | Column          | Type      | #Distinct Values | #Nulls | Max Size | Avg 
> Size |
> +-----------------+-----------+------------------+--------+----------+----------+
> | id              | INT       | 99               | 0      | 4        | 4      
>   |
> ...
> | timestamp_col   | TIMESTAMP | 100              | 0      | 16       | 16     
>   |
> ...
> +-----------------+-----------+------------------+--------+----------+----------+
> {noformat}
> Planning a query, the most critical part, thinks there are only 50 rows:
> {noformat}
> select *
> from functional.alltypesagg join functional_hbase.alltypessmall using (id, 
> int_col)
> |--01:SCAN HBASE [functional_hbase.alltypessmall]
> |     row-size=89B cardinality=50
> {noformat}
> We need a more reliable estimate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to