zabetak commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r3016466506
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/StatEstimator.java:
##########
@@ -39,5 +40,19 @@ public interface StatEstimator {
* @param argStats the statistics for every argument of the UDF
* @return {@link ColStatistics} estimate for the actual UDF.
*/
- public Optional<ColStatistics> estimate(List<ColStatistics> argStats);
+ default Optional<ColStatistics> estimate(List<ColStatistics> argStats) {
+ throw new UnsupportedOperationException("This estimator requires
parentStats");
+ }
Review Comment:
Do we need to change this to default?
##########
ql/src/test/results/clientpositive/llap/vectorized_stats.q.out:
##########
@@ -1207,13 +1207,13 @@ STAGE PLANS:
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0
- Statistics: Num rows: 1 Data size: 40 Basic stats:
COMPLETE Column stats: COMPLETE
+ Statistics: Num rows: 6144 Data size: 183480 Basic
stats: COMPLETE Column stats: COMPLETE
Review Comment:
Thanks for the explanation. Since the new estimate is not always better
maybe we should rediscuss the change in extractNDVGroupingColumns or maybe
address HIVE-29534 first.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -21,16 +21,26 @@
import java.util.Optional;
import org.apache.hadoop.hive.ql.plan.ColStatistics;
+import org.apache.hadoop.hive.ql.stats.StatsUtils;
/**
* Combines {@link ColStatistics} objects to provide the most pessimistic
estimate.
*/
public class PessimisticStatCombiner {
+ private final long numRows;
private boolean inited;
+ private boolean hasUnknownNDV;
private ColStatistics result;
+ public PessimisticStatCombiner(long numRows) {
+ this.numRows = numRows;
+ }
+
public void add(ColStatistics stat) {
+ // NDV==0 means unknown, unless it's a NULL constant (numNulls == numRows)
+ hasUnknownNDV = hasUnknownNDV || (stat.getCountDistint() == 0 &&
stat.getNumNulls() != numRows);
+
Review Comment:
Note that if in the CASE statement you have one branch with unknown NDV and
another that is the NULL constant the result of the combiner will appear to
other consumers as a NULL constant according to the criteria specified here.
Probably an edge case but still wanted to highlight that its hard to cover
everything from the moment that NDV = 0 is a valid value.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2109,7 +2111,10 @@ private static List<Long>
extractNDVGroupingColumns(List<ColStatistics> colStats
for (ColStatistics cs : colStats) {
if (cs != null) {
long ndv = cs.getCountDistint();
- if (cs.getNumNulls() > 0) {
+ // NDV needs to be adjusted if a column has a known NDV along with
NULL values
+ // or if a column happens to be "const NULL"
+ if ((ndv > 0 && cs.getNumNulls() > 0) ||
+ (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() ==
parentStats.getNumRows())) {
Review Comment:
Another reason why I suggest to defer this on another patch is cause I am
not fully convinced if we should actually do a +1 on the NDV for nulls. If we
continue on the assumption that is used elsewhere that null is not a value and
thus does not contribute to NDV then we shouldn't modify the NDV; possibly we
should set the `numNulls` to 1.
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2109,7 +2111,10 @@ private static List<Long>
extractNDVGroupingColumns(List<ColStatistics> colStats
for (ColStatistics cs : colStats) {
if (cs != null) {
long ndv = cs.getCountDistint();
- if (cs.getNumNulls() > 0) {
+ // NDV needs to be adjusted if a column has a known NDV along with
NULL values
+ // or if a column happens to be "const NULL"
+ if ((ndv > 0 && cs.getNumNulls() > 0) ||
+ (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() ==
parentStats.getNumRows())) {
Review Comment:
It seems that this change led to more .q.out changes and addresses an NDV
issue on a slightly different place. Would it be possible to defer this to
another patch/Jira so that we converge and merge this PR sooner. I understand
the intention of the change but I am a bit reluctant to modify multiple things
as part of the same change.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]