shubhluck commented on code in PR #6382:
URL: https://github.com/apache/hive/pull/6382#discussion_r3006807875


##########
ql/src/java/org/apache/hadoop/hive/ql/plan/Statistics.java:
##########
@@ -328,6 +329,19 @@ public List<ColStatistics> getColumnStats() {
     return null;
   }
 
+  /**
+   * Returns the column statistics as a list, or an empty list if column 
statistics are unavailable.
+   * This method is useful to avoid null checks when iterating over column 
statistics.
+   *
+   * @return list of column statistics, or empty list if unavailable
+   */
+  public List<ColStatistics> getColumnStatsOrEmpty() {
+    if (columnStats != null) {
+      return Lists.newArrayList(columnStats.values());
+    }
+    return Collections.emptyList();
+  }
+

Review Comment:
   Thanks for the feedback! I agree that adding a new method increases API 
complexity. I've updated the PR to implement Option 2 + Option 3 together:
   
   **Option 3 (Root cause fix):** Added a precondition check in 
removeSemijoinOptimizationByBenefit():
   ```
   if (filterStats != null && filterStats.getColumnStats() != null) {
   ```
   This prevents the semijoin optimization from proceeding when column 
statistics are unavailable, which was the root cause of the NPE in the original 
TPC-DS workload.
   
   **Option 2 (Defensive null checks):** Added null checks in:
   - StatsUtils.updateStats() - with LOG.warn when stats unavailable
   - StatsUtils.getColStatisticsUpdatingTableAlias()
   - StatsRulesProcFactory.updateColStats()
   - SemanticAnalyzer.getMaterializedTableStats()
   
   Removed: getColumnStatsOrEmpty() method from Statistics.java
   
   **Added .q test:** semijoin_stats_missing_colstats.q - a regression test 
that verifies queries execute successfully when basic table stats exist but 
column stats are unavailable. Note: Reproducing the exact NPE requires the 
original TPC-DS workload where semijoin optimization is actively triggered.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to