IF columns

Konstantin Bereznyakov (Jira) Thu, 11 Dec 2025 16:09:07 -0800

Konstantin Bereznyakov created HIVE-29368:
---------------------------------------------


             Summary: NDV is underestimated for temp tables having CASE/WHEN/IF 
columns
                 Key: HIVE-29368
                 URL: https://issues.apache.org/jira/browse/HIVE-29368
             Project: Hive
          Issue Type: Bug
         Environment: the test file [^mapjoin_case_when_ndv_bug.q] contaisn an 
artificially low max limit for mapjoin conversion. Yet 
[^mapjoin_case_when_ndv_bug.q.out] shows that the conversion still happened, 
the reason being that the "SMALL" table is estimated with 1 record while its 
accurate estimation is 20 records. The query plan shows a MAPJOIN, which, on a 
larger scale, could lead to query failures due to OOMs
            Reporter: Konstantin Bereznyakov
         Attachments: mapjoin_case_when_ndv_bug.q, 
mapjoin_case_when_ndv_bug.q.out

If an intermediate table is created in a subquery using a CASE..WHEN clause 
with constants, like:

{{CASE WHEN category = 1 THEN 'A'}}
{{       WHEN category = 2 THEN 'B'}}
{{       ELSE 'C'}}
{{END}}

each value is correctly considered to have an NDV of 1. 
Later down the road, due to using PessimisticStatCombiner for combining these 
stats to a single stats object for the new column, the value remains 1 for the 
whole column:
[https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java#L152]
[https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java#L45]

When a GROUP BY clause uses this column, the Optimizer correctly estimates the 
maximum possible number of rows in the result as 1, which could lead to overly 
optimistic subsequent steps, including a mapjoin conversion.

This could also lead to false "0" estimations of NOT expressions here: 
[https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L909]
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-29368) NDV is underestimated for temp tables having CASE/WHEN/IF columns

Reply via email to