[
https://issues.apache.org/jira/browse/IMPALA-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185895#comment-17185895
]
Abhishek Rawat edited comment on IMPALA-7876 at 8/27/20, 2:56 PM:
------------------------------------------------------------------
The COMPUTE STATS <table> TABLESAMPLE SYSTEM(<X>) is generating the proper
internal/child sqls for getting table and column stats.
{code:java}
Query: describe t1
+------+------+---------+
| name | type | comment |
+------+------+---------+
| c1 | int | |
| c2 | int | |
+------+------+---------+
{code}
Original SQL:
{code:java}
compute stats t1 tablesample system(10) {code}
Child SQLs:
{code:java}
SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10)
REPEATABLE(1598511315168)
SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL THEN 1
ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS c2,
COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM t1
TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)
{code}
The issue seems to be that the resulting ALTER TABLE ddl (for updating the
stats) has num_rows and num_nulls and num_distinct_values 0.
{code:java}
ExecDdl(): request=TDdlExecRequest {
01: protocol_version (i32) = 0,
03: ddl_type (i32) = 0,
04: alter_table_params (struct) = TAlterTableParams {
01: alter_type (i32) = 12,
02: table_name (struct) = TTableName {
01: db_name (string) = "default",
02: table_name (string) = "t1",
},
12: update_stats_params (struct) = TAlterTableUpdateStatsParams {
01: table_name (struct) = TTableName {
01: db_name (string) = "default",
02: table_name (string) = "t1",
},
02: table_stats (struct) = TTableStats {
01: num_rows (i64) = 0,
02: total_file_bytes (i64) = 1308622848,
},
04: column_stats (map) = map<string,struct>[2] {
"c1" -> TColumnStats {
01: avg_size (double) = 4,
02: max_size (i64) = 0,
03: num_distinct_values (i64) = 10,
04: num_nulls (i64) = 0,
},
"c2" -> TColumnStats {
01: avg_size (double) = 4,
02: max_size (i64) = 0,
03: num_distinct_values (i64) = 10,
04: num_nulls (i64) = 0,
},
},
05: expect_all_partitions (bool) = false,
06: is_incremental (bool) = false,
},
},
22: sync_ddl (bool) = false,
}
{code}
{code:java}
ExecDdl(): response=TDdlExecResponse {
01: result (struct) = TCatalogUpdateResult {
01: catalog_service_id (struct) = TUniqueId {
01: hi (i64) = 8673057423567767407,
02: lo (i64) = -7067907290852819321,
},
02: version (i64) = 5,
03: status (struct) = TStatus {
01: status_code (i32) = 0,
02: error_msgs (list) = list<string>[0] {
},
},
04: is_invalidate (bool) = false,
05: updated_catalog_objects (list) = list<struct>[1] {
[0] = TCatalogObject {
01: type (i32) = 3,
02: catalog_version (i64) = 5,
05: table (struct) = TTable {
01: db_name (string) = "default",
02: tbl_name (string) = "t1",
04: access_level (i32) = 1,
05: columns (list) = list<struct>[2] {
[0] = TColumn {
01: columnName (string) = "c1",
02: columnType (struct) = TColumnType {
01: types (list) = list<struct>[1] {
[0] = TTypeNode {
01: type (i32) = 0,
02: scalar_type (struct) = TScalarType {
01: type (i32) = 5,
},
},
},
},
04: col_stats (struct) = TColumnStats {
01: avg_size (double) = 4,
02: max_size (i64) = 4,
03: num_distinct_values (i64) = 0,
04: num_nulls (i64) = 0,
},
05: position (i32) = 0,
},
[1] = TColumn {
01: columnName (string) = "c2",
02: columnType (struct) = TColumnType {
01: types (list) = list<struct>[1] {
[0] = TTypeNode {
01: type (i32) = 0,
02: scalar_type (struct) = TScalarType {
01: type (i32) = 5,
},
},
},
},
04: col_stats (struct) = TColumnStats {
01: avg_size (double) = 4,
02: max_size (i64) = 4,
03: num_distinct_values (i64) = 0,
04: num_nulls (i64) = 0,
},
05: position (i32) = 1,
},
},
06: clustering_columns (list) = list<struct>[0] {
},
07: table_stats (struct) = TTableStats {
01: num_rows (i64) = 0,
02: total_file_bytes (i64) = 1308622848,
},
08: table_type (i32) = 0,
...
{code}
When I run the internal sqls they seem to be returning proper results.
{code:java}
Query: SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10)
REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:35:32 (Coordinator:
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at:
http://escalation.gce.cloudera.com:25000/query_plan?query_id=c443709b0e0dcdd4:eb49548a00000000
+--------------------------------+
| round(count(*) / 0.8935390115) |
+--------------------------------+
| 88892281 |
+--------------------------------+
Fetched 1 row(s) in 3.04s
{code}
{code:java}
Query: SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL
THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS
c2, COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM
t1 TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:38:24 (Coordinator:
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at:
http://escalation.gce.cloudera.com:25000/query_plan?query_id=e94f564afb2ed3d3:2584eeb200000000
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| c1 | count(case when c1 is null then 1 else null end) | 4 | cast(4 as double)
| c2 | count(case when c2 is null then 1 else null end) | 4 | cast(4 as double)
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| 10 | 0 | 4 | 4
| 10 | 0 | 4 | 4
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
{code}
was (Author: arawat):
The COMPUTE STATS <table> TABLESAMPLE SYSTEM(<X>) is generating the proper
internal/child sqls for getting table and column stats.
{code:java}
Query: describe t1
+------+------+---------+
| name | type | comment |
+------+------+---------+
| c1 | int | |
| c2 | int | |
+------+------+---------+
{code}
Original SQL:
{code:java}
compute stats t1 tablesample system(10) {code}
Child SQLs:
{code:java}
SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10)
REPEATABLE(1598511315168)
SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL THEN 1
ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS c2,
COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM t1
TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)
{code}
The issue seems to be that the resulting ALTER TABLE ddl (for updating the
stats) has num_rows and num_nulls and num_distinct_values 0.
{code:java}
ExecDdl(): request=TDdlExecRequest {
01: protocol_version (i32) = 0,
03: ddl_type (i32) = 0,
04: alter_table_params (struct) = TAlterTableParams {
01: alter_type (i32) = 12,
02: table_name (struct) = TTableName {
01: db_name (string) = "default",
02: table_name (string) = "t1",
},
12: update_stats_params (struct) = TAlterTableUpdateStatsParams {
01: table_name (struct) = TTableName {
01: db_name (string) = "default",
02: table_name (string) = "t1",
},
02: table_stats (struct) = TTableStats {
01: num_rows (i64) = 0,
02: total_file_bytes (i64) = 1308622848,
},
04: column_stats (map) = map<string,struct>[2] {
"c1" -> TColumnStats {
01: avg_size (double) = 4,
02: max_size (i64) = 0,
03: num_distinct_values (i64) = 10,
04: num_nulls (i64) = 0,
},
"c2" -> TColumnStats {
01: avg_size (double) = 4,
02: max_size (i64) = 0,
03: num_distinct_values (i64) = 10,
04: num_nulls (i64) = 0,
},
},
05: expect_all_partitions (bool) = false,
06: is_incremental (bool) = false,
},
},
22: sync_ddl (bool) = false,
}
{code}
When I run the internal sqls they seem to be returning proper results.
{code:java}
Query: SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10)
REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:35:32 (Coordinator:
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at:
http://escalation.gce.cloudera.com:25000/query_plan?query_id=c443709b0e0dcdd4:eb49548a00000000
+--------------------------------+
| round(count(*) / 0.8935390115) |
+--------------------------------+
| 88892281 |
+--------------------------------+
Fetched 1 row(s) in 3.04s
{code}
{code:java}
Query: SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL
THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS
c2, COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM
t1 TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:38:24 (Coordinator:
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at:
http://escalation.gce.cloudera.com:25000/query_plan?query_id=e94f564afb2ed3d3:2584eeb200000000
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| c1 | count(case when c1 is null then 1 else null end) | 4 | cast(4 as double)
| c2 | count(case when c2 is null then 1 else null end) | 4 | cast(4 as double)
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| 10 | 0 | 4 | 4
| 10 | 0 | 4 | 4
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
{code}
> COMPUTE STATS TABLESAMPLE is not updating number of estimated rows
> ------------------------------------------------------------------
>
> Key: IMPALA-7876
> URL: https://issues.apache.org/jira/browse/IMPALA-7876
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 3.0
> Reporter: Andre Araujo
> Assignee: Tim Armstrong
> Priority: Critical
>
> Running the command below seems to have no impact on the #rows stats.
> {code}
> [host:21000] default> COMPUTE STATS wide TABLESAMPLE SYSTEM(5);
> Query: COMPUTE STATS wide TABLESAMPLE SYSTEM(100)
> +-------------------------------------------+
> | summary |
> +-------------------------------------------+
> | Updated 1 partition(s) and 103 column(s). |
> +-------------------------------------------+
> WARNINGS: Ignoring TABLESAMPLE because the effective sampling rate is 100%.
> The minimum sample size is COMPUTE_STATS_MIN_SAMPLE_SIZE=1.00GB and the table
> size 20.35GB
> Fetched 1 row(s) in 43.67s
> [host:21000] default> show table stats wide;
> Query: show table stats wide
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | #Rows | Extrap #Rows | #Files | Size | Bytes Cached | Cache Replication
> | Format | Incremental stats | Location |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | 0 | -1 | 84 | 20.35GB | NOT CACHED | NOT CACHED
> | PARQUET | false | hdfs://ns1/user/hive/warehouse/wide |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> Fetched 1 row(s) in 0.01s
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]