[
https://issues.apache.org/jira/browse/HIVE-29556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov reassigned HIVE-29556:
---------------------------------------------
Assignee: Konstantin Bereznyakov
> StatsUtils::extractNDVGroupingColumns() adjust "unknown" NDVs of "0" to "1"
> ---------------------------------------------------------------------------
>
> Key: HIVE-29556
> URL: https://issues.apache.org/jira/browse/HIVE-29556
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: reproduce_groupby_unknown_ndv.q,
> reproduce_groupby_unknown_ndv.q.out
>
>
> h2. Problem
> In {{StatsUtils.extractNDVGroupingColumns}}, every grouping column whose
> {{ColStatistics}} arrives with the signature {{(NDV == 0 && numNulls > 0)}}
> is unconditionally promoted to {{NDV = 1}} via:
> {code:java}
> if (cs.getNumNulls() > 0) {
> ndv = StatsUtils.safeAdd(ndv, 1);
> }
> {code}
> The {{+1}} adjustment exists to account for the NULL bucket — GROUP BY
> treats NULL as a
> distinct group key in SQL semantics. The intent is correct when {{NDV}}
> carries a real
> non-zero count: {{K + 1}} properly represents "K real groups + 1 NULL
> bucket".
> The bug is that {{NDV = 0}} is *overloaded* in Hive's stats model. It can
> mean either:
> # *Unknown NDV* — stats are partial or NDV was not computed by the
> producer; {{0}} is the
> sentinel for "we don't know".
> # *Verified zero* — the column has zero non-null distinct values (e.g.,
> all-NULL).
> The {{+1}} adjustment yields the correct answer for case (2) — {{1}} is
> exactly the count
> of distinct groups for an all-NULL column (the NULL bucket). For case (1)
> it produces a
> *confidently-wrong NDV count*: it converts "we don't know" into "we know
> there's exactly
> 1 group", masking the unknown nature of the stats.
> h2. How the under-estimate propagates
> {{computeNDVGroupingColumns}} reduces per-column NDVs into {{ndvProduct}}.
> With the bogus
> {{NDV = 1}}, {{ndvProduct >= 1}}. Downstream consumers gate on {{ndvProduct
> == 0}}:
> * {{StatsRulesProcFactory}} (GROUP BY stats annotation) falls back to a
> {{parentNumRows / 2}} heuristic when {{ndvProduct == 0}}; otherwise it uses
> {{ndvProduct}}
> directly. With the bug, the fallback never fires.
> * {{SetHashGroupByMinReduction}} returns {{null}} (keeps the operator's
> default
> {{minReductionHashAggr}}) when {{ndvProduct == 0}}; otherwise it computes
> {{1 - ndvProduct/numRows}}.
> The Case-3 cardinality formula in {{StatsRulesProcFactory}} then becomes:
> {code}
> cardinality = min(parentNumRows/2, ndvProduct * parallelism)
> = min(50M, 1)
> = 1
> {code}
> The GROUP BY operator's row-count annotation in EXPLAIN reads {{Statistics:
> Num rows: 1}}.
> h2. Cascading consequences
> The bogus 1-row estimate propagates downstream and causes CBO to make wrong
> operator-selection decisions:
> * *Map Join selection* — a 1-row side is always broadcast-eligible. The
> optimizer chooses
> {{Map Join Operator}} where the actual data would overflow the broadcast
> threshold (and
> should be a {{Merge Join Operator}}). This is the most visible failure mode
> at scale and
> risks runtime mapjoin OOM.
> * *Reducer parallelism* — too few reducers allocated for what is actually a
> large
> GROUP BY output.
> * *Memory sizing* — undersized hash-aggregation buffers.
> * *Join order* — CBO favors plan shapes built on the "small" intermediate,
> even if the
> actual data isn't small.
> h2. Sources of the {{(NDV=0, numNulls>0)}} signature on master
> The signature can reach {{extractNDVGroupingColumns}} through multiple
> paths in the
> current code:
> # *Metastore-loaded stats* with NDV not gathered but numNulls populated
> (partial ANALYZE;
> explicit {{ALTER TABLE UPDATE STATISTICS}}; upstream stats producer that
> doesn't compute
> NDV).
> # *PessimisticStatCombiner output* for conditional expressions ({{CASE
> WHEN}},
> {{COALESCE}}, {{IF}}) where one branch contributes unknown NDV (NDV=0) and
> another is a
> NULL literal: the combiner's {{max(NDV)}} over {{[0, 0]}} stays at {{0}},
> while
> {{max(numNulls)}} picks up the NULL literal's {{numNulls = parentNumRows}},
> producing the
> signature on the combined output column.
> (The {{NDV=0}} signature can *also* arise from genuinely verified-zero
> sources —
> {{buildColStatForConstant}} for NULL literals and
> {{buildColStatForBoolean}} for
> all-NULL booleans — but those cases are not part of this issue: the {{+1}}
> adjustment
> yields the semantically-correct answer there.)
> h2. Reproduction
> Two probes attached: {{ [^reproduce_groupby_unknown_ndv.q] }} (the queries)
> and
> {{ [^reproduce_groupby_unknown_ndv.q.out] }} (master's output, captured
> against the current master
> branch with {{TestMiniLlapLocalCliDriver}}).
> * *U1* — direct GROUP BY on a column with stats explicitly set to
> {{(numDVs=0,
> numNulls=parentRows)}} (source #1).
> * *U2* — GROUP BY a {{CASE WHEN}} expression with one NULL-literal branch
> combined with
> an unknown-NDV column (source #2).
> For both probes, master estimates {{Statistics: Num rows: 1}} on the inner
> GROUP BY
> operators. With {{hive.auto.convert.join=true}}, the bogus 1-row side is
> broadcast-eligible
> and the optimizer selects {{Map Join Operator}} for the downstream join,
> even though the
> join is sized for {{Merge Join Operator}}. The {{.q.out}} shows two {{Map
> Join Operator}}
> instances — exactly the operator-level under-estimation HIVE-29556 targets.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)