konstantinb commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r3028888736
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -21,16 +21,26 @@
import java.util.Optional;
import org.apache.hadoop.hive.ql.plan.ColStatistics;
+import org.apache.hadoop.hive.ql.stats.StatsUtils;
/**
* Combines {@link ColStatistics} objects to provide the most pessimistic
estimate.
*/
public class PessimisticStatCombiner {
+ private final long numRows;
private boolean inited;
+ private boolean hasUnknownNDV;
private ColStatistics result;
+ public PessimisticStatCombiner(long numRows) {
+ this.numRows = numRows;
+ }
+
public void add(ColStatistics stat) {
+ // NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows)
+ hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 &&
stat.getNumNulls() != numRows);
+
Review Comment:
I have struggled with this too; technically, in the code before these
changes, this makes the combined stats for the column come out with numNulls ==
numRows. The "estimated" flag could be used to decide how much trust a consumer
should put to such statistics entries.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]