Re: [PR] HIVE-27725: Remove redundant columns in TAB_COL_STATS and PART_COL_STATS tables [hive]

via GitHub Sat, 20 Jan 2024 17:55:38 -0800


dengzhhu653 commented on PR #4744:
URL: https://github.com/apache/hive/pull/4744#issuecomment-1902476462


   > @dengzhhu653 @zhangbutao I made another benchmark test with 2 millions 
partitions stats written in advance, it shows no obvious performance regression:
   > 
   > ```sql
   > mysql> select count(1) from PART_COL_STATS;
   > +----------+
   > | count(1) |
   > +----------+
   > |  2000000 |
   > +----------+
   > 1 row in set (0.40 sec)
   > ```
   > 
   > benchmark test:
   > 
   > ```shell
   > java -jar ./hmsbench-jar-with-dependencies.jar -H localhost --savedata 
/tmp/benchdata --sanitize -N 100 -N 1000 -o bench_results_direct.csv -C -d 
testbench_http --params=100  -E 'drop.*' -E 'renameTable.*' -E 
'getTableObjectsByName.*' -E 'listTables.*' -E 'listPartitions.*' -E 
'getPartitionsByNames.*' -E 'getPartitionNames.*' -E 'listPartition' -E 
'getPartition' -E 'getPartitions' -E 'getPartitions.10' -E 'getPartitions.100' 
-E 'getPartitions.1000' -E 'addPartition.*' -E 'addPartitions.*' -E 
'alterPartitions.*' -E 'getNid' -E 'listDatabases' -E 'getTable' -E 
'createTable' -E 'openTxn.*'
   > ```
   > 
   > * before this patch
   > 
   > ```shell
   > Operation       Mean    Med     Min     Max     Err%
   > getPartitionsStat       5.21167 5.16801 4.92140 6.05965 3.77022
   > getPartitionsStat.100   6.93186 6.83728 6.48675 10.2091 6.80759
   > getPartitionsStat.1000  15.1901 14.8172 14.3164 19.6772 6.61940
   > updatePartitionsStat    9.83066 9.63766 9.27253 16.3278 9.28177
   > updatePartitionsStat.100        1009.46 1009.26 991.282 1052.16 0.956140
   > updatePartitionsStat.1000       10091.7 10088.1 9929.50 10309.3 0.760790
   > ```
   > 
   > * after this patch
   > 
   > ```shell
   > Operation       Mean    Med     Min     Max     Err%
   > getPartitionsStat       5.56409 5.49373 5.20583 7.02619 5.03727
   > getPartitionsStat.100   6.34526 6.29966 5.97725 7.85943 4.11913
   > getPartitionsStat.1000  14.2403 14.1247 13.6040 15.8745 3.02256
   > updatePartitionsStat    10.5586 10.3743 9.88599 14.8948 7.01613
   > updatePartitionsStat.100        1013.06 1011.71 978.329 1047.57 1.45127
   > updatePartitionsStat.1000       9912.52 9905.62 9677.24 10163.9 1.22903
   > ```
   
   From what I see in `benchmarkGetPartitionsStat`, looks there is only one 
table, thousands of partitions and col stats, am I missing something?
   I guess the performance regression is caused by multiple join after removing 
the columns, how many databases, tables, partitions in the bench test?
   
   Thanks,
   Zhihua


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-27725: Remove redundant columns in TAB_COL_STATS and PART_COL_STATS tables [hive]

Reply via email to