ngsg commented on code in PR #5674:
URL: https://github.com/apache/hive/pull/5674#discussion_r2024816977


##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java:
##########
@@ -163,7 +172,13 @@ public ColumnStatisticsObj 
aggregate(List<ColStatsObjWithSourceInfo> colStatsWit
           String partName = csp.getPartName();
           LongColumnStatsData newData = cso.getStatsData().getLongStats();
           if (useDensityFunctionForNDVEstimation) {
-            densityAvgSum += ((double) (newData.getHighValue() - 
newData.getLowValue())) / newData.getNumDVs();
+            if (newData.getNumDVs() == 0) {

Review Comment:
   I write a test to validate that `newData.getNumDVs() == 0`. I check the test 
using LlapLocalDriver with an additional modification to 
`data/conf/llap/hivemetastore-site.xml`, specifically setting 
`metastore.stats.ndv.densityfunction` to `true`.
   
   However, I think is-zero test is unnecessary at this point, as JVM does not 
throw an error even when the divisor is 0 (cf. p672 of the [Java Language 
Specification](https://docs.oracle.com/javase/specs/jls/se24/jls24.pdf)).
   
   If NDV is 0, this likely indicates that there are no rows in the given 
partition. In such cases, if high and low values are present(which is a weird 
situation), the density should be set to infinity. I don’t think we need to 
address this situation explicitly, as it should not occur under normal 
circumstances.
   
   For documentation purposes, I attach the test I wrote below:
   ```
   create table x_inventory (inv_quantity_on_hand int) partitioned by 
(inv_date_sk bigint);
   insert into table x_inventory (inv_quantity_on_hand, inv_date_sk) values (0, 
2000), (1, 2001), (2, 2002);
   
   set hive.stats.ndv.algo=fm;
   analyze table x_inventory compute statistics for columns;
   
   alter table x_inventory partition (inv_date_sk = 2002) update statistics for 
column inv_quantity_on_hand set('numDVs'='2','numNulls'='0');
   alter table x_inventory partition (inv_date_sk = 2001) update statistics for 
column inv_quantity_on_hand set('numDVs'='1','numNulls'='0');
   alter table x_inventory partition (inv_date_sk = 2000) update statistics for 
column inv_quantity_on_hand set('numDVs'='0','numNulls'='0');
   
   -- at this point, stats uses FMSketch.
   
   set hive.stats.ndv.algo=hll;
   insert into table x_inventory (inv_quantity_on_hand, inv_date_sk) values (3, 
2003);
   
   -- so there are FMSketch and HyperLogLog.
   
   set hive.stats.column.autogather=false;
   insert into table x_inventory (inv_quantity_on_hand, inv_date_sk) values (4, 
2004);
   -- Finally, this ensures that the colstats is incomplete and not mergable.
   
   desc formatted x_inventory partition (inv_date_sk=2002) inv_quantity_on_hand;
   desc formatted x_inventory partition (inv_date_sk=2003) inv_quantity_on_hand;
   desc formatted x_inventory partition (inv_date_sk=2004) inv_quantity_on_hand;
   
   -- This requests aggregated column stats of inv_quantity_on_hand.
   explain select count(distinct inv_quantity_on_hand) from x_inventory;
   
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Reply via email to