Naresh P R created HIVE-28746:
---------------------------------
Summary: Provide an optional config to autogather column stats
only for columns mentioned in CREATE TABLE STATEMENT
Key: HIVE-28746
URL: https://issues.apache.org/jira/browse/HIVE-28746
Project: Hive
Issue Type: New Feature
Reporter: Naresh P R
Hive by default autogather column stats(hive.stats.column.autogather=true) on
all ETL jobs. This is increasing PART_COL_STATS table size. My cluster has 350g
PART_COL_STATS data in backend db.
As part of CREATE TABLE STATEMENT, we can have an OPTIONAL config to
enable/disable autogather column stats for few specific columns rather than
collecting it automatically for a complete table.
Syntax can be as follows:
{code:java}
CREATE TABLE [TABLE_NAME] (
COL1 [DATATYPE] 'COMMENT' [NO_STATS|NEED_STATS],
...
);
ALTER TABLE [TABLE_NAME] SET AUTOGATHER STATISTICS FOR COLUMNS
[COMMA_SEPARATED_COL_NAMES] [NO_STATS|NEED_STATS];{code}
In ETL flow, disable collecting complete table stats by default and let user
enable stats only for required columns.
Users can identify columns that would be part of join condition, group by, DPP,
filter condition etc and enable only for those columns. This will let ETL to
collect stats only for few required columns on wider table with a lot of
partitions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)