Shubham Sharma created HIVE-29515:
-------------------------------------
Summary: NullPointerException in StatsUtils.updateStats when
column statistics are unavailable during semijoin optimization
Key: HIVE-29515
URL: https://issues.apache.org/jira/browse/HIVE-29515
Project: Hive
Issue Type: Bug
Components: Query Processor, Statistics
Affects Versions: 4.2.0
Reporter: Shubham Sharma
Fix For: 4.3.0
h3. Problem
Query compilation fails with {{NullPointerException}} in
{{StatsUtils.updateStats()}} when column statistics are not available for
certain operators. This occurs during the semijoin optimization phase in
{{{}TezCompiler.removeSemijoinOptimizationByBenefit(){}}}.
The issue is reproducible with TPC-DS queries at scale factors of 100GB or
higher, where column-level statistics may be incomplete or unavailable for some
tables.
{code:java}
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.stats.StatsUtils.updateStats(StatsUtils.java:2067)
at
org.apache.hadoop.hive.ql.parse.TezCompiler.removeSemijoinOptimizationByBenefit(TezCompiler.java:1982)
at
org.apache.hadoop.hive.ql.parse.TezCompiler.semijoinRemovalBasedTransformations(TezCompiler.java:539)
at
org.apache.hadoop.hive.ql.parse.TezCompiler.optimizeOperatorPlan(TezCompiler.java:238)
at
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:174)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.compilePlan(SemanticAnalyzer.java:12521)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12739)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:460)
... {code}
h2. How to Reproduce
# Generate TPC-DS dataset at 100GB or larger scale
# Run TPC-DS queries that involve semijoin optimizations (queries with
subqueries or complex joins, eg: 10 17 19 23 24 25 29 32)
# Ensure column statistics are not fully computed for all tables
# Observe NPE during query compilation
--
This message was sent by Atlassian Jira
(v8.20.10#820010)