Re: [PR] HIVE-29516 Fix NPE in StatsUtils.updateStats when column statistics a… [hive]

via GitHub Mon, 30 Mar 2026 04:51:11 -0700


zabetak commented on code in PR #6382:
URL: https://github.com/apache/hive/pull/6382#discussion_r3008880342



##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column 
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table 
statistics exist
+-- but column-level statistics are unavailable during semijoin removal 
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during 
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;

Review Comment:
   Cleanup is redundant.



##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column 
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table 
statistics exist
+-- but column-level statistics are unavailable during semijoin removal 
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during 
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;
+
+create table t1_semijoin_nocolstats (id int, val string);
+create table t2_semijoin_nocolstats (id int, val string);
+
+-- Insert some data (column stats won't be collected due to autogather=false)
+insert into t1_semijoin_nocolstats values (1, 'a'), (2, 'b'), (3, 'c'), (4, 
'd'), (5, 'e');
+insert into t2_semijoin_nocolstats values (1, 'x'), (2, 'y'), (6, 'z');
+
+-- Set only basic table statistics (numRows, rawDataSize) WITHOUT column 
statistics
+-- This simulates the scenario where column stats are unavailable (e.g., large 
TPC-DS datasets)
+alter table t1_semijoin_nocolstats update statistics 
set('numRows'='100000000', 'rawDataSize'='2000000000');
+alter table t2_semijoin_nocolstats update statistics set('numRows'='1000', 
'rawDataSize'='20000');
+
+-- This join query should trigger semijoin optimization evaluation
+-- Previously this would cause NPE in StatsUtils.updateStats when column stats 
were null
+-- With the fix, it should execute successfully by skipping semijoin 
optimization when column stats are unavailable
+explain
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id;
+
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id
+order by t1.id;
+
+-- Cleanup
+drop table t1_semijoin_nocolstats;
+drop table t2_semijoin_nocolstats;

Review Comment:
   Cleanup is taken care from the test framework so you can remove these lines.



##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column 
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table 
statistics exist
+-- but column-level statistics are unavailable during semijoin removal 
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during 
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;
+
+create table t1_semijoin_nocolstats (id int, val string);
+create table t2_semijoin_nocolstats (id int, val string);
+
+-- Insert some data (column stats won't be collected due to autogather=false)
+insert into t1_semijoin_nocolstats values (1, 'a'), (2, 'b'), (3, 'c'), (4, 
'd'), (5, 'e');
+insert into t2_semijoin_nocolstats values (1, 'x'), (2, 'y'), (6, 'z');

Review Comment:
   Since the problem appears in the compilation there is no need to `INSERT` 
data and run the query.



##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column 
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table 
statistics exist
+-- but column-level statistics are unavailable during semijoin removal 
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during 
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;

Review Comment:
   If we don't run the INSERT we don't need these settings as well.



##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column 
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table 
statistics exist
+-- but column-level statistics are unavailable during semijoin removal 
optimization.
+
+set hive.explain.user=false;

Review Comment:
   `hive.explain.user` is set in `data/conf/llap/hive-site.xml` so redundant 
here.



##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column 
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table 
statistics exist
+-- but column-level statistics are unavailable during semijoin removal 
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during 
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;
+
+create table t1_semijoin_nocolstats (id int, val string);
+create table t2_semijoin_nocolstats (id int, val string);
+
+-- Insert some data (column stats won't be collected due to autogather=false)
+insert into t1_semijoin_nocolstats values (1, 'a'), (2, 'b'), (3, 'c'), (4, 
'd'), (5, 'e');
+insert into t2_semijoin_nocolstats values (1, 'x'), (2, 'y'), (6, 'z');
+
+-- Set only basic table statistics (numRows, rawDataSize) WITHOUT column 
statistics
+-- This simulates the scenario where column stats are unavailable (e.g., large 
TPC-DS datasets)
+alter table t1_semijoin_nocolstats update statistics 
set('numRows'='100000000', 'rawDataSize'='2000000000');
+alter table t2_semijoin_nocolstats update statistics set('numRows'='1000', 
'rawDataSize'='20000');
+
+-- This join query should trigger semijoin optimization evaluation
+-- Previously this would cause NPE in StatsUtils.updateStats when column stats 
were null
+-- With the fix, it should execute successfully by skipping semijoin 
optimization when column stats are unavailable
+explain
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id;
+
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id
+order by t1.id;

Review Comment:
   Execution is not necessary so we can skip this part.



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1929,7 +1929,9 @@ private void 
removeSemijoinOptimizationByBenefit(OptimizeTezProcContext procCtx)
           semijoinRsToRemove.add(rs);
         } else {
           // This semijoin qualifies, add it to the result set
-          if (filterStats != null) {
+          // Only proceed if filterStats has column statistics available, as 
updateStats
+          // will be called with useColStats=true later
+          if (filterStats != null && filterStats.getColumnStats() != null) {

Review Comment:
   If the problem only affects the updateStats method then why not adapting 
directly the code there and pass false whenever we don't have column stats?



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:
##########
@@ -2626,6 +2626,9 @@ private void updateColStats(HiveConf conf, Statistics 
stats, long leftUnmatchedR
       // stats for columns from 1st parent should be scaled down by 200/10 = 
20x
       // and stats for columns from 2nd parent should be scaled down by 200x
       List<ColStatistics> colStats = stats.getColumnStats();
+      if (colStats == null) {
+        colStats = Collections.emptyList();
+      }

Review Comment:
   Based on the call hierarchy this is never null at the moment. We should 
either leave this part unchanged or throw an `IllegalArgumentException` since 
the whole method is meant to handle column statistics so passing a null does 
not make much sense. 



##########
ql/src/test/org/apache/hadoop/hive/ql/plan/TestStatistics.java:
##########
@@ -153,4 +155,25 @@ public void testScaleToRowCountMultipleColumns() {
     assertEquals(-1, scaledCol3.getNumTrues());
     assertEquals(-1, scaledCol3.getNumFalses());
   }
+
+  @Test
+  public void testGetColumnStatsReturnsNullWhenNoColumnStats() {
+    Statistics stats = new Statistics(100, 1000, 0, 0);
+
+    assertNull(stats.getColumnStats());
+  }
+
+  @Test
+  public void testGetColumnStatsReturnsListWhenAvailable() {
+    Statistics stats = new Statistics(100, 1000, 0, 0);
+    ColStatistics colStats = new ColStatistics("col1", "int");
+    colStats.setCountDistint(100);
+    colStats.setNumNulls(0);
+    stats.setColumnStats(Arrays.asList(colStats));
+
+    List<ColStatistics> result = stats.getColumnStats();

Review Comment:
   The test is a trivial input/output validation and does not add much value. 
Since no changes are performed in `ColStatistics` class the tests here appear 
redundant and unecessary.



##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column 
statistics gracefully

Review Comment:
   I run this test using `mvn test -Dtest=TestMiniLlapLocalCliDriver 
-Dqfile=semijoin_stats_missing_colstats.q -Dtest.output.overwrite` in master 
but it does not fail. In addition it seems that the respective .q.out file is 
missing from the PR so it seems that the test and PR is incomplete.



##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2030,34 +2034,41 @@ public static void updateStats(Statistics stats, long 
newNumRows,
 
     if (useColStats) {
       List<ColStatistics> colStats = stats.getColumnStats();
-      for (ColStatistics cs : colStats) {
-        long oldDV = cs.getCountDistint();
-        if (affectedColumns.contains(cs.getColumnName())) {
-          long newDV = oldDV;
-
-          // if ratio is greater than 1, then number of rows increases. This 
can happen
-          // when some operators like GROUPBY duplicates the input rows in 
which case
-          // number of distincts should not change. Update the distinct count 
only when
-          // the output number of rows is less than input number of rows.
-          if (ratio <= 1.0) {
-            newDV = (long) Math.ceil(ratio * oldDV);
+      if (colStats != null && !colStats.isEmpty()) {

Review Comment:
   There is an unwritten assumption that when `useColStats` is `true` then 
column stats are present. It feels like a programming error to call this method 
with true when stats are missing. I would be more inclined to throw an 
`IllegalArgumentException` instead of ignoring the fact that stats are missing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29516 Fix NPE in StatsUtils.updateStats when column statistics a… [hive]

Reply via email to