[ 
https://issues.apache.org/jira/browse/HIVE-29556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29556:
------------------------------------------
    Summary: StatsUtils::extractNDVGroupingColumns() adjusts "unknown" NDVs of 
"0" to "1"  (was: StatsUtils::extractNDVGroupingColumns() adjust "unknown" NDVs 
of "0" to "1")

> StatsUtils::extractNDVGroupingColumns() adjusts "unknown" NDVs of "0" to "1"
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-29556
>                 URL: https://issues.apache.org/jira/browse/HIVE-29556
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Assignee: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: reproduce_groupby_unknown_ndv.q, 
> reproduce_groupby_unknown_ndv.q.out
>
>
> h2. Problem
>   In {{StatsUtils.extractNDVGroupingColumns}}, every grouping column whose
>   {{ColStatistics}} arrives with the signature {{(NDV == 0 && numNulls > 0)}} 
> is unconditionally promoted to {{NDV = 1}} via:
>   {code:java}
>   if (cs.getNumNulls() > 0) {
>     ndv = StatsUtils.safeAdd(ndv, 1);
>   }
>   {code}
>   The {{+1}} adjustment exists to account for the NULL bucket — GROUP BY 
> treats NULL as a
>   distinct group key in SQL semantics. The intent is correct when {{NDV}} 
> carries a real
>   non-zero count: {{K + 1}} properly represents "K real groups + 1 NULL 
> bucket".
>   The bug is that {{NDV = 0}} is *overloaded* in Hive's stats model. It can 
> mean either:
>   # *Unknown NDV* — stats are partial or NDV was not computed by the 
> producer; {{0}} is the
>   sentinel for "we don't know".
>   # *Verified zero* — the column has zero non-null distinct values (e.g., 
> all-NULL).
>   The {{+1}} adjustment yields the correct answer for case (2) — {{1}} is 
> exactly the count
>   of distinct groups for an all-NULL column (the NULL bucket). For case (1) 
> it produces a
>   *confidently-wrong NDV count*: it converts "we don't know" into "we know 
> there's exactly
>   1 group", masking the unknown nature of the stats.
>   h2. How the under-estimate propagates
>   {{computeNDVGroupingColumns}} reduces per-column NDVs into {{ndvProduct}}. 
> With the bogus
>   {{NDV = 1}}, {{ndvProduct >= 1}}. Downstream consumers gate on {{ndvProduct 
> == 0}}:
>   * {{StatsRulesProcFactory}} (GROUP BY stats annotation) falls back to a
>   {{parentNumRows / 2}} heuristic when {{ndvProduct == 0}}; otherwise it uses 
> {{ndvProduct}}
>   directly. With the bug, the fallback never fires.
>   * {{SetHashGroupByMinReduction}} returns {{null}} (keeps the operator's 
> default
>   {{minReductionHashAggr}}) when {{ndvProduct == 0}}; otherwise it computes
>   {{1 - ndvProduct/numRows}}.
>   The Case-3 cardinality formula in {{StatsRulesProcFactory}} then becomes:
>   {code}
>   cardinality = min(parentNumRows/2, ndvProduct * parallelism)
>               = min(50M, 1)
>               = 1
>   {code}
>   The GROUP BY operator's row-count annotation in EXPLAIN reads {{Statistics: 
> Num rows: 1}}.
>   h2. Cascading consequences
>   The bogus 1-row estimate propagates downstream and causes CBO to make wrong
>   operator-selection decisions:
>   * *Map Join selection* — a 1-row side is always broadcast-eligible. The 
> optimizer chooses
>   {{Map Join Operator}} where the actual data would overflow the broadcast 
> threshold (and
>   should be a {{Merge Join Operator}}). This is the most visible failure mode 
> at scale and
>   risks runtime mapjoin OOM.
>   * *Reducer parallelism* — too few reducers allocated for what is actually a 
> large
>   GROUP BY output.
>   * *Memory sizing* — undersized hash-aggregation buffers.
>   * *Join order* — CBO favors plan shapes built on the "small" intermediate, 
> even if the
>   actual data isn't small.
>   h2. Sources of the {{(NDV=0, numNulls>0)}} signature on master
>   The signature can reach {{extractNDVGroupingColumns}} through multiple 
> paths in the
>   current code:
>   # *Metastore-loaded stats* with NDV not gathered but numNulls populated 
> (partial ANALYZE;
>   explicit {{ALTER TABLE UPDATE STATISTICS}}; upstream stats producer that 
> doesn't compute
>   NDV).
>   # *PessimisticStatCombiner output* for conditional expressions ({{CASE 
> WHEN}},
>   {{COALESCE}}, {{IF}}) where one branch contributes unknown NDV (NDV=0) and 
> another is a
>   NULL literal: the combiner's {{max(NDV)}} over {{[0, 0]}} stays at {{0}}, 
> while
>   {{max(numNulls)}} picks up the NULL literal's {{numNulls = parentNumRows}}, 
> producing the
>   signature on the combined output column.
>   (The {{NDV=0}} signature can *also* arise from genuinely verified-zero 
> sources —
>   {{buildColStatForConstant}} for NULL literals and 
> {{buildColStatForBoolean}} for
>   all-NULL booleans — but those cases are not part of this issue: the {{+1}} 
> adjustment
>   yields the semantically-correct answer there.)
>   h2. Reproduction
>   Two probes attached: {{ [^reproduce_groupby_unknown_ndv.q] }} (the queries) 
> and
>   {{ [^reproduce_groupby_unknown_ndv.q.out] }} (master's output, captured 
> against the current master
>   branch with {{TestMiniLlapLocalCliDriver}}).
>   * *U1* — direct GROUP BY on a column with stats explicitly set to 
> {{(numDVs=0,
>   numNulls=parentRows)}} (source #1).
>   * *U2* — GROUP BY a {{CASE WHEN}} expression with one NULL-literal branch 
> combined with
>   an unknown-NDV column (source #2).
>   For both probes, master estimates {{Statistics: Num rows: 1}} on the inner 
> GROUP BY
>   operators. With {{hive.auto.convert.join=true}}, the bogus 1-row side is 
> broadcast-eligible
>   and the optimizer selects {{Map Join Operator}} for the downstream join, 
> even though the
>   join is sized for {{Merge Join Operator}}. The {{.q.out}} shows two {{Map 
> Join Operator}}
>   instances — exactly the operator-level under-estimation HIVE-29556 targets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to