konstantinb commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2651876376


##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +41,15 @@ public void add(ColStatistics stat) {
       if (stat.getAvgColLen() > result.getAvgColLen()) {
         result.setAvgColLen(stat.getAvgColLen());
       }
-      if (stat.getCountDistint() > result.getCountDistint()) {
-        result.setCountDistint(stat.getCountDistint());
-      }
+
+      // NDVs can only be accurately combined if full information about 
columns, query branches and
+      // their relationships is available. Without that info, there is only 
one "truly conservative"
+      // value of NDV which is 0, which means that the NDV is unknown. It 
forces optimizer
+      // to make the most conservative decisions possible, which is the exact 
goal of
+      // PessimisticStatCombiner. It does inflate statistics in multiple 
cases, but at the same time it
+      // also ensures than the query execution does not "blow up" due to too 
optimistic stats estimates
+      result.setCountDistint(0L);

Review Comment:
   This could appear counter-intuitive at first, however, when combining 
statistics of different logical branches of the same column, and having no 
reliable information about their interdependencies (i.e. in a "truly 
pessimistic" scenario), every other option appears to introduce undesired 
under-estimations, which often lead to catastrophic query failures.
   
   For example, a simple column generated by an CASE..WHEN clause with three 
constants produces an NDV of 1 by the original code, while, in most cases, the 
"true" NDV is 3. If such a column participates in a GROUP BY condition later 
on, its estimated number of records naturally becomes "1". Even this seemingly 
small under-estimation could lead to bad decision of converting to a mapjoin or 
not, especially over large data sets.
   
   Alternatively, trying to "total up" NDV values of the same columns could 
cause over-estimation of the true NDV of such a column, which, it its turn, 
could lead to a severe underestimation of records matching an "IN" filter, 
ultimately producing equally bad results as the previous case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to