ngsg commented on code in PR #5674: URL: https://github.com/apache/hive/pull/5674#discussion_r2024816977
########## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java: ########## @@ -163,7 +172,13 @@ public ColumnStatisticsObj aggregate(List<ColStatsObjWithSourceInfo> colStatsWit String partName = csp.getPartName(); LongColumnStatsData newData = cso.getStatsData().getLongStats(); if (useDensityFunctionForNDVEstimation) { - densityAvgSum += ((double) (newData.getHighValue() - newData.getLowValue())) / newData.getNumDVs(); + if (newData.getNumDVs() == 0) { Review Comment: I write a test to validate that `newData.getNumDVs() == 0`. I check the test using LlapLocalDriver with an additional modification to `data/conf/llap/hivemetastore-site.xml`, specifically setting `metastore.stats.ndv.densityfunction` to `true`. However, I think is-zero test is unnecessary at this point, as JVM does not throw an error even when the divisor is 0 (cf. p672 of the [Java Language Specification](https://docs.oracle.com/javase/specs/jls/se24/jls24.pdf)). If NDV is 0, this likely indicates that there are no rows in the given partition. In such cases, if high and low values are present(which is a weird situation), the density should be set to infinity. I don’t think we need to address this situation explicitly, as it should not occur under normal circumstances. For documentation purposes, I attach the test I wrote below: ``` create table x_inventory (inv_quantity_on_hand int) partitioned by (inv_date_sk bigint); insert into table x_inventory (inv_quantity_on_hand, inv_date_sk) values (0, 2000), (1, 2001), (2, 2002); set hive.stats.ndv.algo=fm; analyze table x_inventory compute statistics for columns; alter table x_inventory partition (inv_date_sk = 2002) update statistics for column inv_quantity_on_hand set('numDVs'='2','numNulls'='0'); alter table x_inventory partition (inv_date_sk = 2001) update statistics for column inv_quantity_on_hand set('numDVs'='1','numNulls'='0'); alter table x_inventory partition (inv_date_sk = 2000) update statistics for column inv_quantity_on_hand set('numDVs'='0','numNulls'='0'); -- at this point, stats uses FMSketch. set hive.stats.ndv.algo=hll; insert into table x_inventory (inv_quantity_on_hand, inv_date_sk) values (3, 2003); -- so there are FMSketch and HyperLogLog. set hive.stats.column.autogather=false; insert into table x_inventory (inv_quantity_on_hand, inv_date_sk) values (4, 2004); -- Finally, this ensures that the colstats is incomplete and not mergable. desc formatted x_inventory partition (inv_date_sk=2002) inv_quantity_on_hand; desc formatted x_inventory partition (inv_date_sk=2003) inv_quantity_on_hand; desc formatted x_inventory partition (inv_date_sk=2004) inv_quantity_on_hand; -- This requests aggregated column stats of inv_quantity_on_hand. explain select count(distinct inv_quantity_on_hand) from x_inventory; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org For additional commands, e-mail: gitbox-h...@hive.apache.org