[
https://issues.apache.org/jira/browse/HIVE-29556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084476#comment-18084476
]
Konstantin Bereznyakov commented on HIVE-29556:
-----------------------------------------------
[~zabetak] Thank you for following up on this; I am sorry for the lack of
further updates for some time. Your instinct against adding isConst() proved
fully justified, as trying to implement it accurately quickly introduced new
dependent inaccuracies. I am referring to my attempts to address
[https://github.com/apache/hive/pull/6418#discussion_r3233225674], which broke
a few tests and required subsequent changes to support isConst in multiple
places.
I had since abandoned the idea of adding isConst in favor of "honestly"
implementing an unknown NDV as "-1", similarly to numNulls/numTrues/numFalses.
[https://github.com/apache/hive/pull/6505] contains the proposed changes than
naturally make adding isConst() unnecessary while logically solving the "+1"
problem in extractNDVGroupingColumns()
{quote}"Sources of the (NDV=0, numNulls>0)"
{quote}
There is one scenario like this immediately exposed by the integration test
parquet_types_non_dictionary_encoding_vectorization.q
*BinaryColumnStatsData* class does not have an NDV field while having numNulls,
so the ColStatistics currently generated for this data type naturally defaults
to the NDV of 0, immediately collapsing the estimation of the GROUP BY product
to "1".
Another typical scenario occurs with using privately controlled implementations
of metastores (IMetaStoreClient implementations) other than Hive's own. It is
very expensive to accurately track the number of unique values in very large
tables, and "unknown" may be the only practical choice for those.
This is a significant problem in a private implementation I am working on, and
we already had to implement some private fork changes to work around this. The
objective of this and related stories is to sync up, if possible.
The bug https://issues.apache.org/jira/browse/HIVE-29368 about
PessimisticStatCombiner was the original trigger, but the "unknown NDV" problem
has since become a standalone issue. Also, the desired fix for
PessimisticStatCombiner becomes trivial if we resolve the NDV ambiguity.
> StatsUtils::extractNDVGroupingColumns() adjusts "unknown" NDVs of "0" to "1"
> ----------------------------------------------------------------------------
>
> Key: HIVE-29556
> URL: https://issues.apache.org/jira/browse/HIVE-29556
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Assignee: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: reproduce_groupby_unknown_ndv.q,
> reproduce_groupby_unknown_ndv.q.out
>
>
> h2. Problem
> In {{StatsUtils.extractNDVGroupingColumns}}, every grouping column whose
> {{ColStatistics}} arrives with the signature {{(NDV == 0 && numNulls > 0)}}
> is unconditionally promoted to {{NDV = 1}} via:
> {code:java}
> if (cs.getNumNulls() > 0) {
> ndv = StatsUtils.safeAdd(ndv, 1);
> }
> {code}
> The {{+1}} adjustment exists to account for the NULL bucket — GROUP BY
> treats NULL as a
> distinct group key in SQL semantics. The intent is correct when {{NDV}}
> carries a real
> non-zero count: {{K + 1}} properly represents "K real groups + 1 NULL
> bucket".
> The bug is that {{NDV = 0}} is *overloaded* in Hive's stats model. It can
> mean either:
> # *Unknown NDV* — stats are partial or NDV was not computed by the
> producer; {{0}} is the
> sentinel for "we don't know".
> # *Verified zero* — the column has zero non-null distinct values (e.g.,
> all-NULL).
> The {{+1}} adjustment yields the correct answer for case (2) — {{1}} is
> exactly the count
> of distinct groups for an all-NULL column (the NULL bucket). For case (1)
> it produces a
> *confidently-wrong NDV count*: it converts "we don't know" into "we know
> there's exactly
> 1 group", masking the unknown nature of the stats.
> h2. How the under-estimate propagates
> {{computeNDVGroupingColumns}} reduces per-column NDVs into {{ndvProduct}}.
> With the bogus
> {{NDV = 1}}, {{ndvProduct >= 1}}. Downstream consumers gate on {{ndvProduct
> == 0}}:
> * {{StatsRulesProcFactory}} (GROUP BY stats annotation) falls back to a
> {{parentNumRows / 2}} heuristic when {{ndvProduct == 0}}; otherwise it uses
> {{ndvProduct}}
> directly. With the bug, the fallback never fires.
> * {{SetHashGroupByMinReduction}} returns {{null}} (keeps the operator's
> default
> {{minReductionHashAggr}}) when {{ndvProduct == 0}}; otherwise it computes
> {{1 - ndvProduct/numRows}}.
> The Case-3 cardinality formula in {{StatsRulesProcFactory}} then becomes:
> {code}
> cardinality = min(parentNumRows/2, ndvProduct * parallelism)
> = min(50M, 1)
> = 1
> {code}
> The GROUP BY operator's row-count annotation in EXPLAIN reads {{Statistics:
> Num rows: 1}}.
> h2. Cascading consequences
> The bogus 1-row estimate propagates downstream and causes CBO to make wrong
> operator-selection decisions:
> * *Map Join selection* — a 1-row side is always broadcast-eligible. The
> optimizer chooses
> {{Map Join Operator}} where the actual data would overflow the broadcast
> threshold (and
> should be a {{Merge Join Operator}}). This is the most visible failure mode
> at scale and
> risks runtime mapjoin OOM.
> * *Reducer parallelism* — too few reducers allocated for what is actually a
> large
> GROUP BY output.
> * *Memory sizing* — undersized hash-aggregation buffers.
> * *Join order* — CBO favors plan shapes built on the "small" intermediate,
> even if the
> actual data isn't small.
> h2. Sources of the {{(NDV=0, numNulls>0)}} signature on master
> The signature can reach {{extractNDVGroupingColumns}} through multiple
> paths in the
> current code:
> # *Metastore-loaded stats* with NDV not gathered but numNulls populated
> (partial ANALYZE;
> explicit {{ALTER TABLE UPDATE STATISTICS}}; upstream stats producer that
> doesn't compute
> NDV).
> # *PessimisticStatCombiner output* for conditional expressions ({{CASE
> WHEN}},
> {{COALESCE}}, {{IF}}) where one branch contributes unknown NDV (NDV=0) and
> another is a
> NULL literal: the combiner's {{max(NDV)}} over {{[0, 0]}} stays at {{0}},
> while
> {{max(numNulls)}} picks up the NULL literal's {{numNulls = parentNumRows}},
> producing the
> signature on the combined output column.
> (The {{NDV=0}} signature can *also* arise from genuinely verified-zero
> sources —
> {{buildColStatForConstant}} for NULL literals and
> {{buildColStatForBoolean}} for
> all-NULL booleans — but those cases are not part of this issue: the {{+1}}
> adjustment
> yields the semantically-correct answer there.)
> h2. Reproduction
> Two probes attached: {{ [^reproduce_groupby_unknown_ndv.q] }} (the queries)
> and
> {{ [^reproduce_groupby_unknown_ndv.q.out] }} (master's output, captured
> against the current master
> branch with {{TestMiniLlapLocalCliDriver}}).
> * *U1* — direct GROUP BY on a column with stats explicitly set to
> {{(numDVs=0,
> numNulls=parentRows)}} (source #1).
> * *U2* — GROUP BY a {{CASE WHEN}} expression with one NULL-literal branch
> combined with
> an unknown-NDV column (source #2).
> For both probes, master estimates {{Statistics: Num rows: 1}} on the inner
> GROUP BY
> operators. With {{hive.auto.convert.join=true}}, the bogus 1-row side is
> broadcast-eligible
> and the optimizer selects {{Map Join Operator}} for the downstream join,
> even though the
> join is sized for {{Merge Join Operator}}. The {{.q.out}} shows two {{Map
> Join Operator}}
> instances — exactly the operator-level under-estimation HIVE-29556 targets.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)