zabetak commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r3033018561
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java:
##########
@@ -2109,7 +2111,10 @@ private static List<Long>
extractNDVGroupingColumns(List<ColStatistics> colStats
for (ColStatistics cs : colStats) {
if (cs != null) {
long ndv = cs.getCountDistint();
- if (cs.getNumNulls() > 0) {
+ // NDV needs to be adjusted if a column has a known NDV along with
NULL values
+ // or if a column happens to be "const NULL"
+ if ((ndv > 0 && cs.getNumNulls() > 0) ||
+ (ndv == 0 && !cs.isEstimated() && cs.getNumNulls() ==
parentStats.getNumRows())) {
Review Comment:
CASE WHEN statements can appear everywhere in the query plan and the changes
in the combiner make this stats better. I understand that there is some
correlation with grouping columns but not all queries have a GROUP BY clause
thus I feel we could separate the two problems.
I am not opposed in changing extractNDVGroupingColumns but was thinking that
its probably better to do it in a separate PR. If we can't, then option 2 seems
more appealing although I couldn't find out why this +1 logic was introduced in
the first place.
Overall, I trust your judgement since you spend much more time on the topic
so I am happy to let you decide how you want to move forward.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]