[
https://issues.apache.org/jira/browse/HIVE-29516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-29516:
----------------------------------
Labels: pull-request-available (was: )
> NullPointerException in StatsUtils.updateStats when column statistics are
> unavailable during semijoin optimization
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-29516
> URL: https://issues.apache.org/jira/browse/HIVE-29516
> Project: Hive
> Issue Type: Bug
> Components: Query Processor, Statistics
> Affects Versions: 4.2.0
> Reporter: Shubham Sharma
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.3.0
>
>
> h3. Problem
> Query compilation fails with {{NullPointerException}} in
> {{StatsUtils.updateStats()}} when column statistics are not available for
> certain operators. This occurs during the semijoin optimization phase in
> {{{}TezCompiler.removeSemijoinOptimizationByBenefit(){}}}.
> The issue is reproducible with TPC-DS queries at scale factors of 100GB or
> higher, where column-level statistics may be incomplete or unavailable for
> some tables.
>
>
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hive.ql.stats.StatsUtils.updateStats(StatsUtils.java:2067)
> at
> org.apache.hadoop.hive.ql.parse.TezCompiler.removeSemijoinOptimizationByBenefit(TezCompiler.java:1982)
> at
> org.apache.hadoop.hive.ql.parse.TezCompiler.semijoinRemovalBasedTransformations(TezCompiler.java:539)
> at
> org.apache.hadoop.hive.ql.parse.TezCompiler.optimizeOperatorPlan(TezCompiler.java:238)
> at
> org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:174)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.compilePlan(SemanticAnalyzer.java:12521)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12739)
> at
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:460)
> ... {code}
> h2. How to Reproduce
> # Generate TPC-DS dataset at 100GB or larger scale
> # Run TPC-DS queries that involve semijoin optimizations (queries with
> subqueries or complex joins, eg: 10 17 19 23 24 25 29 32)
> # Ensure column statistics are not fully computed for all tables
> # Observe NPE during query compilation
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)