konstantinb commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2771730406
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +41,15 @@ public void add(ColStatistics stat) {
if (stat.getAvgColLen() > result.getAvgColLen()) {
result.setAvgColLen(stat.getAvgColLen());
}
- if (stat.getCountDistint() > result.getCountDistint()) {
- result.setCountDistint(stat.getCountDistint());
- }
+
+ // NDVs can only be accurately combined if full information about
columns, query branches and
+ // their relationships is available. Without that info, there is only
one "truly conservative"
+ // value of NDV which is 0, which means that the NDV is unknown. It
forces optimizer
+ // to make the most conservative decisions possible, which is the exact
goal of
+ // PessimisticStatCombiner. It does inflate statistics in multiple
cases, but at the same time it
+ // also ensures than the query execution does not "blow up" due to too
optimistic stats estimates
+ result.setCountDistint(0L);
Review Comment:
Edit: per the PR feedback, this has been refined to only set NDV to
"Unknown" if either part of the combined values is also "Unknown", resulting in
much better estimates
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]