[
https://issues.apache.org/jira/browse/HIVE-29432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055899#comment-18055899
]
Thomas Rebele commented on HIVE-29432:
--------------------------------------
The plan for calculating the statistics is added at the end of
{{{}org.apache.hadoop.hive.ql.parse.SemanticAnalyzer#genFileSinkPlan{}}}. The
method {{canRunAutogatherStats(Operator curr)}} checks whether all types are
supported. If there is any column with an unsupported type, the pipeline for
autogather will not be added.
> Statistics missing for tables with a TIMESTAMP WITH LOCAL TIME ZONE
> -------------------------------------------------------------------
>
> Key: HIVE-29432
> URL: https://issues.apache.org/jira/browse/HIVE-29432
> Project: Hive
> Issue Type: Bug
> Affects Versions: 4.3.0
> Reporter: Thomas Rebele
> Priority: Major
>
> Given the following qfile:
> {code:java}
> set hive.stats.kll.enable=true;
> set metastore.stats.fetch.bitvector=true;
> set metastore.stats.fetch.kll=true;
> set hive.stats.autogather=true;
> set hive.stats.column.autogather=true;
> CREATE TABLE test_stats0 (a int, b timestamp) STORED AS TEXTFILE;
> CREATE TABLE test_stats1 (a int, b timestamp with local time zone) STORED AS
> TEXTFILE;
> INSERT INTO test_stats0 (a, b) VALUES (1, "2020-11-02 00:00:00");
> INSERT INTO test_stats1 (a, b) VALUES (1, "2020-11-02 00:00:00");
> DESCRIBE FORMATTED test_stats0 a;
> DESCRIBE FORMATTED test_stats0 b;
> DESCRIBE FORMATTED test_stats1 a;
> DESCRIBE FORMATTED test_stats1 b;
> {code}
> The statistics for test_stats0 column a are computed successfully:
> {code:java}
> POSTHOOK: Input: default@test_stats0
> col_name a
> data_type int
> min 1
> max 1
> num_nulls 0
> distinct_count 1
> avg_col_len
> max_col_len
> num_trues
> num_falses
> bit_vector HL
> histogram Q1: 1, Q2: 1, Q3: 1
> {code}
> However, the statistics for test_stats1 column a are missing:
> {code:java}
> POSTHOOK: Input: default@test_stats1
> col_name a
> data_type int
> min
> max
> num_nulls
> distinct_count
> avg_col_len
> max_col_len
> num_trues
> num_falses
> bit_vector
> histogram
> {code}
> Similar for column b, i.e., stats are available for table test_stats0, but
> not for test_stats1.
> Even if the stats for a TIMESTAMP WITH LOCAL TIME ZONE column cannot be
> calculated, it should not affect the other columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)