Re: [PR] HIVE-27725: Remove redundant columns in TAB_COL_STATS and PART_COL_STATS tables [hive]

via GitHub Mon, 22 Jan 2024 11:16:55 -0800


deniskuzZ commented on PR #4744:
URL: https://github.com/apache/hive/pull/4744#issuecomment-1904647165


   > > From what I see in benchmarkGetPartitionsStat, looks there is only one 
table, thousands of partitions and col stats, am I missing something?
   > 
   > 2 millions partition column stats are added in the mysql before benchmark 
test. benchmarkGetPartitionsStat will still only query the count as input by -N 
options.
   > 
   > > I guess the performance regression is caused by multiple join after 
removing the columns, how many databases, tables, partitions in the bench test?
   > 
   > The 2 millions partition column stats are: 10 dbs * 10 tbls per db * 200 
partitions per tbl * 100 column stats per partition
   > 
   > BTW, all joins across DBS, TBLS, and PARTITIONS tables would hit indexes, 
I don't think such join would cause performance regression:
   > 
   > ```sql
   > mysql> explain select * from PART_COL_STATS join PARTITIONS on 
PARTITIONS.PART_ID=PART_COL_STATS.PART_ID join TBLS on 
TBLS.TBL_ID=PARTITIONS.TBL_ID join DBS on DBS.DB_ID=TBLS.DB_ID where 
DBS.NAME='db0' and DBS.CTLG_NAME='hive' and
   >  TBLS.TBL_NAME='tbl0' and PARTITIONS.PART_NAME='p=part_val0' and 
PART_COL_STATS.engine='hive' and PART_COL_STATS.COLUMN_NAME='id0';
   > 
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
   > | id | select_type | table          | partitions | type  | possible_keys   
                 | key             | key_len | ref         | rows | filtered | 
Extra       |
   > 
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
   > |  1 | SIMPLE      | DBS            | NULL       | const | 
PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389     | const,const |    
1 |   100.00 | NULL        |
   > |  1 | SIMPLE      | TBLS           | NULL       | const | 
PRIMARY,UNIQUETABLE,TBLS_N49     | UNIQUETABLE     | 268     | const,const |    
1 |   100.00 | NULL        |
   > |  1 | SIMPLE      | PARTITIONS     | NULL       | const | 
PRIMARY,UNIQUEPARTITION          | UNIQUEPARTITION | 779     | const,const |    
1 |   100.00 | NULL        |
   > |  1 | SIMPLE      | PART_COL_STATS | NULL       | ref   | PCS_STATS_IDX   
                 | PCS_STATS_IDX   | 777     | const,const |    1 |    10.00 | 
Using where |
   > 
+----+-------------+----------------+------------+-------+----------------------------------+-----------------+---------+-------------+------+----------+-------------+
   > 4 rows in set, 1 warning (0.00 sec)
   > ```
   
   in production, I've seen an average 100 tbl x 1000 part x 100 cols 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-27725: Remove redundant columns in TAB_COL_STATS and PART_COL_STATS tables [hive]

Reply via email to