[jira] [Comment Edited] (IMPALA-7876) COMPUTE STATS TABLESAMPLE is not updating number of estimated rows

Abhishek Rawat (Jira) Thu, 27 Aug 2020 07:57:13 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185895#comment-17185895
 ]


Abhishek Rawat edited comment on IMPALA-7876 at 8/27/20, 2:56 PM:
------------------------------------------------------------------

The COMPUTE STATS <table> TABLESAMPLE SYSTEM(<X>) is generating the proper 
internal/child sqls for getting table and column stats. 
{code:java}
Query: describe t1
+------+------+---------+
| name | type | comment |
+------+------+---------+
| c1   | int  |         |
| c2   | int  |         |
+------+------+---------+


{code}
 Original SQL:
{code:java}
compute stats t1 tablesample system(10) {code}
Child SQLs:
{code:java}
SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10) 
REPEATABLE(1598511315168)
SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL THEN 1 
ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS c2, 
COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM t1 
TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)


{code}
 

The issue seems to be that the resulting ALTER TABLE ddl (for updating the 
stats) has num_rows and num_nulls and num_distinct_values 0.
{code:java}
ExecDdl(): request=TDdlExecRequest {
  01: protocol_version (i32) = 0,
  03: ddl_type (i32) = 0,
  04: alter_table_params (struct) = TAlterTableParams {
    01: alter_type (i32) = 12,
    02: table_name (struct) = TTableName {
      01: db_name (string) = "default",
      02: table_name (string) = "t1",
    },
    12: update_stats_params (struct) = TAlterTableUpdateStatsParams {
      01: table_name (struct) = TTableName {
        01: db_name (string) = "default",
        02: table_name (string) = "t1",
      },
      02: table_stats (struct) = TTableStats {
        01: num_rows (i64) = 0,
        02: total_file_bytes (i64) = 1308622848,
      },
      04: column_stats (map) = map<string,struct>[2] {
        "c1" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,
          04: num_nulls (i64) = 0,
        },
        "c2" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,
          04: num_nulls (i64) = 0,
        },
      },
      05: expect_all_partitions (bool) = false,
      06: is_incremental (bool) = false,
    },
  },
  22: sync_ddl (bool) = false,
}


{code}
 
{code:java}
ExecDdl(): response=TDdlExecResponse {
  01: result (struct) = TCatalogUpdateResult {
    01: catalog_service_id (struct) = TUniqueId {
      01: hi (i64) = 8673057423567767407,
      02: lo (i64) = -7067907290852819321,
    },
    02: version (i64) = 5,
    03: status (struct) = TStatus {
      01: status_code (i32) = 0,
      02: error_msgs (list) = list<string>[0] {
      },
    },
    04: is_invalidate (bool) = false,
    05: updated_catalog_objects (list) = list<struct>[1] {
      [0] = TCatalogObject {
        01: type (i32) = 3,
        02: catalog_version (i64) = 5,
        05: table (struct) = TTable {
          01: db_name (string) = "default",
          02: tbl_name (string) = "t1",
          04: access_level (i32) = 1,
          05: columns (list) = list<struct>[2] {
            [0] = TColumn {
              01: columnName (string) = "c1",
              02: columnType (struct) = TColumnType {
                01: types (list) = list<struct>[1] {
                  [0] = TTypeNode {
                    01: type (i32) = 0,
                    02: scalar_type (struct) = TScalarType {
                      01: type (i32) = 5,
                    },
                  },
                },
              },
              04: col_stats (struct) = TColumnStats {
                01: avg_size (double) = 4,
                02: max_size (i64) = 4,
                03: num_distinct_values (i64) = 0,
                04: num_nulls (i64) = 0,
              },
              05: position (i32) = 0,
            },
            [1] = TColumn {
              01: columnName (string) = "c2",
              02: columnType (struct) = TColumnType {
                01: types (list) = list<struct>[1] {
                  [0] = TTypeNode {
                    01: type (i32) = 0,
                    02: scalar_type (struct) = TScalarType {
                      01: type (i32) = 5,
                    },
                  },
                },
              },
              04: col_stats (struct) = TColumnStats {
                01: avg_size (double) = 4,
                02: max_size (i64) = 4,
                03: num_distinct_values (i64) = 0,
                04: num_nulls (i64) = 0,
              },
              05: position (i32) = 1,
            },
          },
          06: clustering_columns (list) = list<struct>[0] {
          },
          07: table_stats (struct) = TTableStats {
            01: num_rows (i64) = 0,
            02: total_file_bytes (i64) = 1308622848,
          },
          08: table_type (i32) = 0,
...
{code}
 

  

When I run the internal sqls they seem to be returning proper results.
{code:java}
Query: SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10) 
REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:35:32 (Coordinator: 
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at: 
http://escalation.gce.cloudera.com:25000/query_plan?query_id=c443709b0e0dcdd4:eb49548a00000000
+--------------------------------+
| round(count(*) / 0.8935390115) |
+--------------------------------+
| 88892281                       |
+--------------------------------+
Fetched 1 row(s) in 3.04s


{code}
 
{code:java}
Query: SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL 
THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS 
c2, COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM 
t1 TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:38:24 (Coordinator: 
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at: 
http://escalation.gce.cloudera.com:25000/query_plan?query_id=e94f564afb2ed3d3:2584eeb200000000
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| c1 | count(case when c1 is null then 1 else null end) | 4 | cast(4 as double) 
| c2 | count(case when c2 is null then 1 else null end) | 4 | cast(4 as double) 
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| 10 | 0                                                | 4 | 4                 
| 10 | 0                                                | 4 | 4                 
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+


{code}


was (Author: arawat):
The COMPUTE STATS <table> TABLESAMPLE SYSTEM(<X>) is generating the proper 
internal/child sqls for getting table and column stats. 
{code:java}
Query: describe t1
+------+------+---------+
| name | type | comment |
+------+------+---------+
| c1   | int  |         |
| c2   | int  |         |
+------+------+---------+


{code}
 Original SQL: 
{code:java}
compute stats t1 tablesample system(10) {code}
Child SQLs:
{code:java}
SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10) 
REPEATABLE(1598511315168)
SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL THEN 1 
ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS c2, 
COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM t1 
TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)


{code}
 

The issue seems to be that the resulting ALTER TABLE ddl (for updating the 
stats) has num_rows and num_nulls and num_distinct_values 0.
{code:java}
ExecDdl(): request=TDdlExecRequest {
  01: protocol_version (i32) = 0,
  03: ddl_type (i32) = 0,
  04: alter_table_params (struct) = TAlterTableParams {
    01: alter_type (i32) = 12,
    02: table_name (struct) = TTableName {
      01: db_name (string) = "default",
      02: table_name (string) = "t1",
    },
    12: update_stats_params (struct) = TAlterTableUpdateStatsParams {
      01: table_name (struct) = TTableName {
        01: db_name (string) = "default",
        02: table_name (string) = "t1",
      },
      02: table_stats (struct) = TTableStats {
        01: num_rows (i64) = 0,
        02: total_file_bytes (i64) = 1308622848,
      },
      04: column_stats (map) = map<string,struct>[2] {
        "c1" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,
          04: num_nulls (i64) = 0,
        },
        "c2" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,
          04: num_nulls (i64) = 0,
        },
      },
      05: expect_all_partitions (bool) = false,
      06: is_incremental (bool) = false,
    },
  },
  22: sync_ddl (bool) = false,
}


{code}
  

When I run the internal sqls they seem to be returning proper results.
{code:java}
Query: SELECT ROUND(COUNT(*) / 0.8935390115) FROM t1 TABLESAMPLE SYSTEM(10) 
REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:35:32 (Coordinator: 
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at: 
http://escalation.gce.cloudera.com:25000/query_plan?query_id=c443709b0e0dcdd4:eb49548a00000000
+--------------------------------+
| round(count(*) / 0.8935390115) |
+--------------------------------+
| 88892281                       |
+--------------------------------+
Fetched 1 row(s) in 3.04s


{code}
 
{code:java}
Query: SELECT SAMPLED_NDV(c1, 0.8935390115) AS c1, COUNT(CASE WHEN c1 IS NULL 
THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE), SAMPLED_NDV(c2, 0.8935390115) AS 
c2, COUNT(CASE WHEN c2 IS NULL THEN 1 ELSE NULL END), 4, CAST(4 as DOUBLE) FROM 
t1 TABLESAMPLE SYSTEM(10) REPEATABLE(1598511315168)
Query submitted at: 2020-08-27 07:38:24 (Coordinator: 
http://escalation.gce.cloudera.com:25000)
Query progress can be monitored at: 
http://escalation.gce.cloudera.com:25000/query_plan?query_id=e94f564afb2ed3d3:2584eeb200000000
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| c1 | count(case when c1 is null then 1 else null end) | 4 | cast(4 as double) 
| c2 | count(case when c2 is null then 1 else null end) | 4 | cast(4 as double) 
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+
| 10 | 0                                                | 4 | 4                 
| 10 | 0                                                | 4 | 4                 
|
+----+--------------------------------------------------+---+-------------------+----+--------------------------------------------------+---+-------------------+


{code}

> COMPUTE STATS TABLESAMPLE is not updating number of estimated rows
> ------------------------------------------------------------------
>
>                 Key: IMPALA-7876
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7876
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Andre Araujo
>            Assignee: Tim Armstrong
>            Priority: Critical
>
> Running the command below seems to have no impact on the #rows stats.
> {code}
> [host:21000] default> COMPUTE STATS wide TABLESAMPLE SYSTEM(5);
> Query: COMPUTE STATS wide TABLESAMPLE SYSTEM(100)
> +-------------------------------------------+
> | summary                                   |
> +-------------------------------------------+
> | Updated 1 partition(s) and 103 column(s). |
> +-------------------------------------------+
> WARNINGS: Ignoring TABLESAMPLE because the effective sampling rate is 100%.
> The minimum sample size is COMPUTE_STATS_MIN_SAMPLE_SIZE=1.00GB and the table 
> size 20.35GB
> Fetched 1 row(s) in 43.67s
> [host:21000] default> show table stats wide;
> Query: show table stats wide
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | #Rows | Extrap #Rows | #Files | Size    | Bytes Cached | Cache Replication 
> | Format  | Incremental stats | Location                            |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | 0     | -1           | 84     | 20.35GB | NOT CACHED   | NOT CACHED        
> | PARQUET | false             | hdfs://ns1/user/hive/warehouse/wide |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> Fetched 1 row(s) in 0.01s
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (IMPALA-7876) COMPUTE STATS TABLESAMPLE is not updating number of estimated rows

Reply via email to