okumin commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2780780676
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2086,11 +2088,7 @@ private static List<Long>
extractNDVGroupingColumns(List<ColStatistics> colStats
// compute product of distinct values of grouping columns
for (ColStatistics cs : colStats) {
if (cs != null) {
- long ndv = cs.getCountDistint();
- if (cs.getNumNulls() > 0) {
- ndv = StatsUtils.safeAdd(ndv, 1);
- }
- ndvValues.add(ndv);
+ ndvValues.add(getGroupingColumnNdv(cs, parentStats));
Review Comment:
Please create a separate pull request next time you update something with
global impact. This PR affects approximately 200 test cases and would make it
harder for a reviewer to validate them if it included two or more types of
changes.
##########
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFCoalesce.java:
##########
@@ -83,18 +91,47 @@ public String getDisplayString(String[] children) {
@Override
public StatEstimator getStatEstimator() {
- return new CoalesceStatEstimator();
+ return new CoalesceStatEstimator(firstConstantIndex);
}
+ /**
+ * COALESCE returns the first non-null argument, so only values before (and
including)
+ * the first constant are reachable. Constants after the first one can never
be returned.
+ */
static class CoalesceStatEstimator implements StatEstimator {
+ private final int firstConstantIndex;
+
+ CoalesceStatEstimator(int firstConstantIndex) {
+ this.firstConstantIndex = firstConstantIndex;
+ }
@Override
public Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
PessimisticStatCombiner combiner = new PessimisticStatCombiner();
- for (int i = 0; i < argStats.size(); i++) {
+
+ if (firstConstantIndex == 0) {
+ // First arg is constant - always returns that constant, NDV = 1
+ combiner.add(argStats.get(0));
+ return combiner.getResult();
+ }
+
+ // Combine stats of columns before the first constant (or all if no
constant)
+ int limit = firstConstantIndex > 0 ? firstConstantIndex :
argStats.size();
+ for (int i = 0; i < limit; i++) {
combiner.add(argStats.get(i));
}
- return combiner.getResult();
+
+ Optional<ColStatistics> result = combiner.getResult();
+
+ // If there's a constant after columns, add 1 to NDV for that constant
+ if (result.isPresent() && firstConstantIndex > 0) {
Review Comment:
I guess it would be more consistent without this branch. The number of
distinct values of `coalesce(col_bool, false)` is up to 2, and `if(col_bool IS
NOT NULL, col_bool, false)` or an equivalent using CASE is likely to return 2,
but this probably returns 3 if I understand correctly.
##########
ql/src/test/queries/clientpositive/branching_expr_ndv.q:
##########
@@ -0,0 +1,75 @@
+CREATE TABLE t (cond INT, c2 STRING, c100 STRING, c0 STRING);
+ALTER TABLE t UPDATE STATISTICS SET('numRows'='10000','rawDataSize'='1000000');
+ALTER TABLE t UPDATE STATISTICS FOR COLUMN cond
SET('numDVs'='10','numNulls'='0');
+ALTER TABLE t UPDATE STATISTICS FOR COLUMN c2
SET('numDVs'='2','numNulls'='0','avgColLen'='5','maxColLen'='10');
+ALTER TABLE t UPDATE STATISTICS FOR COLUMN c100
SET('numDVs'='100','numNulls'='0','avgColLen'='5','maxColLen'='10');
+ALTER TABLE t UPDATE STATISTICS FOR COLUMN c0
SET('numDVs'='0','numNulls'='0','avgColLen'='5','maxColLen'='10');
+
+-- CASE WHEN: all constants distinct (NDV=3)
+EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' ELSE 'C' END x
FROM t GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' ELSE 'C' END;
+
+-- CASE WHEN: all constants with duplicate (NDV=2)
+EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'A' ELSE 'B' END x
FROM t GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'A' ELSE 'B' END;
+
+-- CASE WHEN: constants with NULL (NDV=1, NULL literal is not a
ConstantObjectInspector)
+EXPLAIN SELECT CASE WHEN cond=1 THEN NULL WHEN cond=2 THEN NULL ELSE 'A' END x
FROM t GROUP BY CASE WHEN cond=1 THEN NULL WHEN cond=2 THEN NULL ELSE 'A' END;
+
+-- CASE WHEN: all NULL constants (NDV=1)
+EXPLAIN SELECT CASE WHEN cond=1 THEN NULL ELSE NULL END x FROM t GROUP BY CASE
WHEN cond=1 THEN NULL ELSE NULL END;
+
+-- CASE WHEN: 3 constants + column, constants dominate (NDV=max(3,2)=3)
+EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' WHEN cond=3 THEN
'C' ELSE c2 END x FROM t GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN
'B' WHEN cond=3 THEN 'C' ELSE c2 END;
+
+-- CASE WHEN: 2 constants + column, column dominates (NDV=max(2,100)=100)
+EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' ELSE c100 END x
FROM t GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' ELSE c100 END;
+
+-- CASE WHEN: constant + unknown column (when NDV=0, Hive uses numRows/2
fallback)
+EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' ELSE c0 END x FROM t GROUP BY CASE
WHEN cond=1 THEN 'A' ELSE c0 END;
+
+-- CASE WHEN: all columns, no constants (NDV=max(2,100)=100)
+EXPLAIN SELECT CASE WHEN cond=1 THEN c2 ELSE c100 END x FROM t GROUP BY CASE
WHEN cond=1 THEN c2 ELSE c100 END;
+
+-- CASE WHEN: no ELSE clause (NDV=1, implicit NULL ELSE is not a
ConstantObjectInspector)
+EXPLAIN SELECT CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END x FROM t
GROUP BY CASE WHEN cond=1 THEN 'A' WHEN cond=2 THEN 'B' END;
Review Comment:
I wonder why this is not identical to [this test
case](https://github.com/konstantinb/hive/blob/b59cc9dd645b8d5859307ffcbe6a6155cfb8fc7d/ql/src/test/org/apache/hadoop/hive/ql/udf/generic/TestGenericUDFWhenStatEstimator.java#L93-L114)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]