[
https://issues.apache.org/jira/browse/HIVE-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29625:
------------------------------------------
Description:
h2. Problem
{{ColStatistics.countDistinct}} (NDV) overloads the value {{0}}:
* *Verified zero* — the column genuinely has zero non-NULL distinct values
(all-NULL column, empty table).
* *Unknown* — upstream stats did not compute NDV; {{0}} leaks through as the
Thrift-primitive default from {{ColumnStatisticsObj.numDVs}}.
Downstream consumers cannot tell the two cases apart, so they apply identical
fallback heuristics ({{numRows / 2}}, {{factor *= 0.5}}, {{MAX_VALUE}}, etc.)
to
both. For *verified zero* the heuristic is wrong (the true answer for
{{col = const}} is 0 matching rows), and for *unknown* it merely papers over
absent information.
The other count-style fields on {{ColStatistics}} ({{numNulls}}, {{numTrues}},
{{numFalses}}) already follow the convention "negative = unknown, 0 = verified
zero, positive = verified count" — established by HIVE-29438.
{{countDistinct}}
never got the same treatment.
h2. Convention after this change
For {{ColStatistics.countDistinct}}:
* *-1* (or any negative value) means *unknown* — NDV was not gathered or
cannot be determined.
* *0* means *verified zero* — the column has zero non-NULL distinct values.
* *positive value* means *verified count* — exactly that many distinct
non-NULL values.
This matches the existing convention for {{numNulls}}, {{numTrues}},
{{numFalses}}.
was:
h2. Problem
\{{ColStatistics.countDistinct}} (NDV) overloads the value \{{0}}:
* *Verified zero* — the column genuinely has zero non-NULL distinct values
(all-NULL column, empty table).
* *Unknown* — upstream stats did not compute NDV; \{{0}} leaks through as the
Thrift-primitive default from \{{ColumnStatisticsObj.numDVs}}.
Downstream consumers cannot tell the two cases apart, so they apply identical
fallback heuristics (\{{numRows / 2}}, \{{factor *= 0.5}}, \{{MAX_VALUE}},
etc.) to
both. For *verified zero* the heuristic is wrong (the true answer for
\{{col = const}} is 0 matching rows), and for *unknown* it merely papers over
absent information.
The other count-style fields on \{{ColStatistics}} (\{{numNulls}},
\{{numTrues}},
\{{numFalses}}) already follow the convention "negative = unknown, 0 =
verified
zero, positive = verified count" — established by HIVE-29438.
\{{countDistinct}}
never got the same treatment.
h2. Convention after this change
For \{{ColStatistics.countDistinct}}:
* *-1* (or any negative value) means *unknown* — NDV was not gathered or
cannot be determined.
* *0* means *verified zero* — the column has zero non-NULL distinct values.
* *positive value* means *verified count* — exactly that many distinct
non-NULL values.
This matches the existing convention for \{{numNulls}}, \{{numTrues}},
\{{numFalses}}.
> Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"
> -----------------------------------------------------------------------
>
> Key: HIVE-29625
> URL: https://issues.apache.org/jira/browse/HIVE-29625
> Project: Hive
> Issue Type: Improvement
> Reporter: Konstantin Bereznyakov
> Priority: Major
>
> h2. Problem
> {{ColStatistics.countDistinct}} (NDV) overloads the value {{0}}:
> * *Verified zero* — the column genuinely has zero non-NULL distinct values
> (all-NULL column, empty table).
> * *Unknown* — upstream stats did not compute NDV; {{0}} leaks through as the
> Thrift-primitive default from {{ColumnStatisticsObj.numDVs}}.
> Downstream consumers cannot tell the two cases apart, so they apply
> identical
> fallback heuristics ({{numRows / 2}}, {{factor *= 0.5}}, {{MAX_VALUE}},
> etc.) to
> both. For *verified zero* the heuristic is wrong (the true answer for
> {{col = const}} is 0 matching rows), and for *unknown* it merely papers over
> absent information.
> The other count-style fields on {{ColStatistics}} ({{numNulls}},
> {{numTrues}},
> {{numFalses}}) already follow the convention "negative = unknown, 0 =
> verified
> zero, positive = verified count" — established by HIVE-29438.
> {{countDistinct}}
> never got the same treatment.
> h2. Convention after this change
> For {{ColStatistics.countDistinct}}:
> * *-1* (or any negative value) means *unknown* — NDV was not gathered or
> cannot be determined.
> * *0* means *verified zero* — the column has zero non-NULL distinct values.
> * *positive value* means *verified count* — exactly that many distinct
> non-NULL values.
> This matches the existing convention for {{numNulls}}, {{numTrues}},
> {{numFalses}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)