[ 
https://issues.apache.org/jira/browse/HIVE-29368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-29368:
----------------------------------
    Labels: pull-request-available  (was: )

> NDV is underestimated for temp tables having CASE/WHEN/IF columns
> -----------------------------------------------------------------
>
>                 Key: HIVE-29368
>                 URL: https://issues.apache.org/jira/browse/HIVE-29368
>             Project: Hive
>          Issue Type: Bug
>         Environment: the test file [^mapjoin_case_when_ndv_bug.q] contaisn an 
> artificially low max limit for mapjoin conversion. Yet 
> [^mapjoin_case_when_ndv_bug.q.out] shows that the conversion still happened, 
> the reason being that the "SMALL" table is estimated with 1 record while its 
> accurate estimation is 20 records. The query plan shows a MAPJOIN, which, on 
> a larger scale, could lead to query failures due to OOMs
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: mapjoin_case_when_ndv_bug.q, 
> mapjoin_case_when_ndv_bug.q.out
>
>
> If an intermediate table is created in a subquery using a CASE..WHEN clause 
> with constants, like:
> {{CASE WHEN category = 1 THEN 'A'}}
> {{       WHEN category = 2 THEN 'B'}}
> {{       ELSE 'C'}}
> {{END}}
> each value is correctly considered to have an NDV of 1. 
> Later down the road, due to using PessimisticStatCombiner for combining these 
> stats to a single stats object for the new column, the value remains 1 for 
> the whole column:
> [https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java#L152]
> [https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java#L45]
> When a GROUP BY clause uses this column, the Optimizer correctly estimates 
> the maximum possible number of rows in the result as 1, which could lead to 
> overly optimistic subsequent steps, including a mapjoin conversion.
> This could also lead to false "0" estimations of NOT expressions here: 
> [https://github.com/apache/hive/blob/2cd59de300e9dae3fe1d6d2538efdbf56f80c763/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L909]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to