[ 
https://issues.apache.org/jira/browse/IMPALA-7876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185905#comment-17185905
 ] 

Abhishek Rawat edited comment on IMPALA-7876 at 8/27/20, 3:06 PM:
------------------------------------------------------------------

Its also strange that the ddlExecRequest has proper column stats but the 
ddlExecResponse seems to be overwriting them to zeros? In this case there are 
10 distinct values in the table for both the columns. _num_distinct_values_ is 
10 for both c1 and c2 in the request, but int the response they are zero! 

 
{code:java}
      04: column_stats (map) = map<string,struct>[2] {
        "c1" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,
          04: num_nulls (i64) = 0,
        },
        "c2" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,
          04: num_nulls (i64) = 0,
        },
{code}
 

 
{code:java}
          [0] = TColumn {
              01: columnName (string) = "c1",
              02: columnType (struct) = TColumnType {
                01: types (list) = list<struct>[1] {
                  [0] = TTypeNode {
                    01: type (i32) = 0,
                    02: scalar_type (struct) = TScalarType {
                      01: type (i32) = 5,
                    },
                  },
                },
              },
              04: col_stats (struct) = TColumnStats {
                01: avg_size (double) = 4,
                02: max_size (i64) = 4,
                03: num_distinct_values (i64) = 0,
                04: num_nulls (i64) = 0,
              },
              05: position (i32) = 0,
            },
            [1] = TColumn {
              01: columnName (string) = "c2",
              02: columnType (struct) = TColumnType {
                01: types (list) = list<struct>[1] {
                  [0] = TTypeNode {
                    01: type (i32) = 0,
                    02: scalar_type (struct) = TScalarType {
                      01: type (i32) = 5,
                    },
                  },
                },
              },
              04: col_stats (struct) = TColumnStats {
                01: avg_size (double) = 4,
                02: max_size (i64) = 4,
                03: num_distinct_values (i64) = 0,
                04: num_nulls (i64) = 0,
              },
              05: position (i32) = 1,
{code}
 

 


was (Author: arawat):
Its also strange that the ddlExecRequest has proper column stats but the 
ddlExecResponse seems to be overwriting them to zeros? In this case there are 
10 distinct values in the table for both the columns. _num_distinct_values_ is 
10 for both c1 and c2 in the request, but int the response they are zero! 
04: column_stats (map) = map<string,struct>[2] {
        "c1" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,  <<<<<<<<<<<<<<<<<
          04: num_nulls (i64) = 0,
        },
        "c2" -> TColumnStats {
          01: avg_size (double) = 4,
          02: max_size (i64) = 0,
          03: num_distinct_values (i64) = 10,   <<<<<<<<<<<<<<<<<<
          04: num_nulls (i64) = 0,
        },
 
             [0] = TColumn {
              01: columnName (string) = "c1",
              02: columnType (struct) = TColumnType \{
                01: types (list) = list<struct>[1] {
                  [0] = TTypeNode {
                    01: type (i32) = 0,
                    02: scalar_type (struct) = TScalarType {
                      01: type (i32) = 5,
                    },
                  },
                },
              },
              04: col_stats (struct) = TColumnStats {
                01: avg_size (double) = 4,
                02: max_size (i64) = 4,
                03: num_distinct_values (i64) = 0,  <<<<<<<<<<<<<<<<<<<<<<
                04: num_nulls (i64) = 0,
              },
              05: position (i32) = 0,
            },
            [1] = TColumn {
              01: columnName (string) = "c2",
              02: columnType (struct) = TColumnType \{
                01: types (list) = list<struct>[1] {
                  [0] = TTypeNode {
                    01: type (i32) = 0,
                    02: scalar_type (struct) = TScalarType {
                      01: type (i32) = 5,
                    },
                  },
                },
              },
              04: col_stats (struct) = TColumnStats {
                01: avg_size (double) = 4,
                02: max_size (i64) = 4,
                03: num_distinct_values (i64) = 0, <<<<<<<<<<<<<<<<<<<<<<<
                04: num_nulls (i64) = 0,
              },
              05: position (i32) = 1,

> COMPUTE STATS TABLESAMPLE is not updating number of estimated rows
> ------------------------------------------------------------------
>
>                 Key: IMPALA-7876
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7876
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Andre Araujo
>            Assignee: Tim Armstrong
>            Priority: Critical
>
> Running the command below seems to have no impact on the #rows stats.
> {code}
> [host:21000] default> COMPUTE STATS wide TABLESAMPLE SYSTEM(5);
> Query: COMPUTE STATS wide TABLESAMPLE SYSTEM(100)
> +-------------------------------------------+
> | summary                                   |
> +-------------------------------------------+
> | Updated 1 partition(s) and 103 column(s). |
> +-------------------------------------------+
> WARNINGS: Ignoring TABLESAMPLE because the effective sampling rate is 100%.
> The minimum sample size is COMPUTE_STATS_MIN_SAMPLE_SIZE=1.00GB and the table 
> size 20.35GB
> Fetched 1 row(s) in 43.67s
> [host:21000] default> show table stats wide;
> Query: show table stats wide
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | #Rows | Extrap #Rows | #Files | Size    | Bytes Cached | Cache Replication 
> | Format  | Incremental stats | Location                            |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> | 0     | -1           | 84     | 20.35GB | NOT CACHED   | NOT CACHED        
> | PARQUET | false             | hdfs://ns1/user/hive/warehouse/wide |
> +-------+--------------+--------+---------+--------------+-------------------+---------+-------------------+-------------------------------------+
> Fetched 1 row(s) in 0.01s
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to