[
https://issues.apache.org/jira/browse/HIVE-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-29625:
----------------------------------
Labels: pull-request-available (was: )
> Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"
> -----------------------------------------------------------------------
>
> Key: HIVE-29625
> URL: https://issues.apache.org/jira/browse/HIVE-29625
> Project: Hive
> Issue Type: Improvement
> Reporter: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
>
> h2. Problem
> {{ColStatistics.countDistinct}} (NDV) overloads the value {{0}}:
> * *Verified zero* — the column genuinely has zero non-NULL distinct values
> (all-NULL column, empty table).
> * *Unknown* — upstream stats did not compute NDV; {{0}} leaks through as the
> Thrift-primitive default from {{ColumnStatisticsObj.numDVs}}.
> Downstream consumers cannot tell the two cases apart, so they apply
> identical
> fallback heuristics ({{numRows / 2}}, {{factor *= 0.5}}, {{MAX_VALUE}},
> etc.) to
> both. For *verified zero* the heuristic is wrong (the true answer for
> {{col = const}} is 0 matching rows), and for *unknown* it merely papers over
> absent information.
> The other count-style fields on {{ColStatistics}} ({{numNulls}},
> {{numTrues}},
> {{numFalses}}) already follow the convention "negative = unknown, 0 =
> verified
> zero, positive = verified count" — established by HIVE-29438.
> {{countDistinct}}
> never got the same treatment.
> h2. Convention after this change
> For {{ColStatistics.countDistinct}}:
> * *-1* (or any negative value) means *unknown* — NDV was not gathered or
> cannot be determined.
> * *0* means *verified zero* — the column has zero non-NULL distinct values.
> * *positive value* means *verified count* — exactly that many distinct
> non-NULL values.
> This matches the existing convention for {{numNulls}}, {{numTrues}},
> {{numFalses}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)