Konstantin Bereznyakov created HIVE-29368:
---------------------------------------------
Summary: NDV is underestimated for temp tables having CASE/WHEN/IF
columns
Key: HIVE-29368
URL: https://issues.apache.org/jira/browse/HIVE-29368
Project: Hive
Issue Type: Bug
Environment: the test file [^mapjoin_case_when_ndv_bug.q] contaisn an
artificially low max limit for mapjoin conversion. Yet
[^mapjoin_case_when_ndv_bug.q.out] shows that the conversion still happened,
the reason being that the "SMALL" table is estimated with 1 record while its
accurate estimation is 20 records. The query plan shows a MAPJOIN, which, on a
larger scale, could lead to query failures due to OOMs
Reporter: Konstantin Bereznyakov
Attachments: mapjoin_case_when_ndv_bug.q,
mapjoin_case_when_ndv_bug.q.out
If an intermediate table is created in a subquery using a CASE..WHEN clause
with constants, like:
{{CASE WHEN category = 1 THEN 'A'}}
{{ WHEN category = 2 THEN 'B'}}
{{ ELSE 'C'}}
{{END}}
each value is correctly considered to have an NDV of 1.
Later down the road, due to using PessimisticStatCombiner for combining these
stats to a single stats object for the new column, the value remains 1 for the
whole column:
[https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java#L152]
[https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java#L45]
When a GROUP BY clause uses this column, the Optimizer correctly estimates the
maximum possible number of rows in the result as 1, which could lead to overly
optimistic subsequent steps, including a mapjoin conversion.
This could also lead to false "0" estimations of NOT expressions here:
[https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L909]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)