[
https://issues.apache.org/jira/browse/HIVE-29368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HIVE-29368 started by Konstantin Bereznyakov.
-----------------------------------------------------
> NDV is underestimated for temp tables having CASE/WHEN/IF columns
> -----------------------------------------------------------------
>
> Key: HIVE-29368
> URL: https://issues.apache.org/jira/browse/HIVE-29368
> Project: Hive
> Issue Type: Bug
> Environment: the test file [^mapjoin_case_when_ndv_bug.q] contaisn an
> artificially low max limit for mapjoin conversion. Yet
> [^mapjoin_case_when_ndv_bug.q.out] shows that the conversion still happened,
> the reason being that the "SMALL" table is estimated with 1 record while its
> accurate estimation is 20 records. The query plan shows a MAPJOIN, which, on
> a larger scale, could lead to query failures due to OOMs
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Attachments: mapjoin_case_when_ndv_bug.q,
> mapjoin_case_when_ndv_bug.q.out
>
>
> If an intermediate table is created in a subquery using a CASE..WHEN clause
> with constants, like:
> {{CASE WHEN category = 1 THEN 'A'}}
> {{ WHEN category = 2 THEN 'B'}}
> {{ ELSE 'C'}}
> {{END}}
> each value is correctly considered to have an NDV of 1.
> Later down the road, due to using PessimisticStatCombiner for combining these
> stats to a single stats object for the new column, the value remains 1 for
> the whole column:
> [https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java#L152]
> [https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java#L45]
> When a GROUP BY clause uses this column, the Optimizer correctly estimates
> the maximum possible number of rows in the result as 1, which could lead to
> overly optimistic subsequent steps, including a mapjoin conversion.
> This could also lead to false "0" estimations of NOT expressions here:
> [https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L909]
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)