zabetak commented on code in PR #6382:
URL: https://github.com/apache/hive/pull/6382#discussion_r3008880342
##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table
statistics exist
+-- but column-level statistics are unavailable during semijoin removal
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;
Review Comment:
Cleanup is redundant.
##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table
statistics exist
+-- but column-level statistics are unavailable during semijoin removal
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;
+
+create table t1_semijoin_nocolstats (id int, val string);
+create table t2_semijoin_nocolstats (id int, val string);
+
+-- Insert some data (column stats won't be collected due to autogather=false)
+insert into t1_semijoin_nocolstats values (1, 'a'), (2, 'b'), (3, 'c'), (4,
'd'), (5, 'e');
+insert into t2_semijoin_nocolstats values (1, 'x'), (2, 'y'), (6, 'z');
+
+-- Set only basic table statistics (numRows, rawDataSize) WITHOUT column
statistics
+-- This simulates the scenario where column stats are unavailable (e.g., large
TPC-DS datasets)
+alter table t1_semijoin_nocolstats update statistics
set('numRows'='100000000', 'rawDataSize'='2000000000');
+alter table t2_semijoin_nocolstats update statistics set('numRows'='1000',
'rawDataSize'='20000');
+
+-- This join query should trigger semijoin optimization evaluation
+-- Previously this would cause NPE in StatsUtils.updateStats when column stats
were null
+-- With the fix, it should execute successfully by skipping semijoin
optimization when column stats are unavailable
+explain
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id;
+
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id
+order by t1.id;
+
+-- Cleanup
+drop table t1_semijoin_nocolstats;
+drop table t2_semijoin_nocolstats;
Review Comment:
Cleanup is taken care from the test framework so you can remove these lines.
##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table
statistics exist
+-- but column-level statistics are unavailable during semijoin removal
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;
+
+create table t1_semijoin_nocolstats (id int, val string);
+create table t2_semijoin_nocolstats (id int, val string);
+
+-- Insert some data (column stats won't be collected due to autogather=false)
+insert into t1_semijoin_nocolstats values (1, 'a'), (2, 'b'), (3, 'c'), (4,
'd'), (5, 'e');
+insert into t2_semijoin_nocolstats values (1, 'x'), (2, 'y'), (6, 'z');
Review Comment:
Since the problem appears in the compilation there is no need to `INSERT`
data and run the query.
##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table
statistics exist
+-- but column-level statistics are unavailable during semijoin removal
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
Review Comment:
If we don't run the INSERT we don't need these settings as well.
##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table
statistics exist
+-- but column-level statistics are unavailable during semijoin removal
optimization.
+
+set hive.explain.user=false;
Review Comment:
`hive.explain.user` is set in `data/conf/llap/hive-site.xml` so redundant
here.
##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column
statistics gracefully
+-- This test verifies that queries don't fail with NPE when basic table
statistics exist
+-- but column-level statistics are unavailable during semijoin removal
optimization.
+
+set hive.explain.user=false;
+set hive.tez.dynamic.semijoin.reduction=true;
+set hive.tez.dynamic.semijoin.reduction.threshold=0.1;
+set hive.auto.convert.join=true;
+set hive.auto.convert.join.noconditionaltask=true;
+set hive.auto.convert.join.noconditionaltask.size=10000000000;
+
+-- Disable automatic stats gathering so column stats won't be collected during
insert
+set hive.stats.autogather=false;
+set hive.stats.column.autogather=false;
+
+-- Create tables
+drop table if exists t1_semijoin_nocolstats;
+drop table if exists t2_semijoin_nocolstats;
+
+create table t1_semijoin_nocolstats (id int, val string);
+create table t2_semijoin_nocolstats (id int, val string);
+
+-- Insert some data (column stats won't be collected due to autogather=false)
+insert into t1_semijoin_nocolstats values (1, 'a'), (2, 'b'), (3, 'c'), (4,
'd'), (5, 'e');
+insert into t2_semijoin_nocolstats values (1, 'x'), (2, 'y'), (6, 'z');
+
+-- Set only basic table statistics (numRows, rawDataSize) WITHOUT column
statistics
+-- This simulates the scenario where column stats are unavailable (e.g., large
TPC-DS datasets)
+alter table t1_semijoin_nocolstats update statistics
set('numRows'='100000000', 'rawDataSize'='2000000000');
+alter table t2_semijoin_nocolstats update statistics set('numRows'='1000',
'rawDataSize'='20000');
+
+-- This join query should trigger semijoin optimization evaluation
+-- Previously this would cause NPE in StatsUtils.updateStats when column stats
were null
+-- With the fix, it should execute successfully by skipping semijoin
optimization when column stats are unavailable
+explain
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id;
+
+select t1.id, t1.val, t2.val
+from t1_semijoin_nocolstats t1 join t2_semijoin_nocolstats t2 on t1.id = t2.id
+order by t1.id;
Review Comment:
Execution is not necessary so we can skip this part.
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1929,7 +1929,9 @@ private void
removeSemijoinOptimizationByBenefit(OptimizeTezProcContext procCtx)
semijoinRsToRemove.add(rs);
} else {
// This semijoin qualifies, add it to the result set
- if (filterStats != null) {
+ // Only proceed if filterStats has column statistics available, as
updateStats
+ // will be called with useColStats=true later
+ if (filterStats != null && filterStats.getColumnStats() != null) {
Review Comment:
If the problem only affects the updateStats method then why not adapting
directly the code there and pass false whenever we don't have column stats?
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java:
##########
@@ -2626,6 +2626,9 @@ private void updateColStats(HiveConf conf, Statistics
stats, long leftUnmatchedR
// stats for columns from 1st parent should be scaled down by 200/10 =
20x
// and stats for columns from 2nd parent should be scaled down by 200x
List<ColStatistics> colStats = stats.getColumnStats();
+ if (colStats == null) {
+ colStats = Collections.emptyList();
+ }
Review Comment:
Based on the call hierarchy this is never null at the moment. We should
either leave this part unchanged or throw an `IllegalArgumentException` since
the whole method is meant to handle column statistics so passing a null does
not make much sense.
##########
ql/src/test/org/apache/hadoop/hive/ql/plan/TestStatistics.java:
##########
@@ -153,4 +155,25 @@ public void testScaleToRowCountMultipleColumns() {
assertEquals(-1, scaledCol3.getNumTrues());
assertEquals(-1, scaledCol3.getNumFalses());
}
+
+ @Test
+ public void testGetColumnStatsReturnsNullWhenNoColumnStats() {
+ Statistics stats = new Statistics(100, 1000, 0, 0);
+
+ assertNull(stats.getColumnStats());
+ }
+
+ @Test
+ public void testGetColumnStatsReturnsListWhenAvailable() {
+ Statistics stats = new Statistics(100, 1000, 0, 0);
+ ColStatistics colStats = new ColStatistics("col1", "int");
+ colStats.setCountDistint(100);
+ colStats.setNumNulls(0);
+ stats.setColumnStats(Arrays.asList(colStats));
+
+ List<ColStatistics> result = stats.getColumnStats();
Review Comment:
The test is a trivial input/output validation and does not add much value.
Since no changes are performed in `ColStatistics` class the tests here appear
redundant and unecessary.
##########
ql/src/test/queries/clientpositive/semijoin_stats_missing_colstats.q:
##########
@@ -0,0 +1,45 @@
+-- HIVE-29516: Test that semijoin optimization handles missing column
statistics gracefully
Review Comment:
I run this test using `mvn test -Dtest=TestMiniLlapLocalCliDriver
-Dqfile=semijoin_stats_missing_colstats.q -Dtest.output.overwrite` in master
but it does not fail. In addition it seems that the respective .q.out file is
missing from the PR so it seems that the test and PR is incomplete.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2030,34 +2034,41 @@ public static void updateStats(Statistics stats, long
newNumRows,
if (useColStats) {
List<ColStatistics> colStats = stats.getColumnStats();
- for (ColStatistics cs : colStats) {
- long oldDV = cs.getCountDistint();
- if (affectedColumns.contains(cs.getColumnName())) {
- long newDV = oldDV;
-
- // if ratio is greater than 1, then number of rows increases. This
can happen
- // when some operators like GROUPBY duplicates the input rows in
which case
- // number of distincts should not change. Update the distinct count
only when
- // the output number of rows is less than input number of rows.
- if (ratio <= 1.0) {
- newDV = (long) Math.ceil(ratio * oldDV);
+ if (colStats != null && !colStats.isEmpty()) {
Review Comment:
There is an unwritten assumption that when `useColStats` is `true` then
column stats are present. It feels like a programming error to call this method
with true when stats are missing. I would be more inclined to throw an
`IllegalArgumentException` instead of ignoring the fact that stats are missing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]