konstantinb commented on PR #6359: URL: https://github.com/apache/hive/pull/6359#issuecomment-4158960930
> From my perspective missing NDVs is an edge case and not something that should appear too often during optimization. @zabetak, over the last few months of struggling with NDVs, I have identified various ways to trigger bad estimates with the current code. The following [estimation change in an .out file](https://github.com/apache/hive/pull/6359#discussion_r3012726545) immediately highlights a 35x underestimation of row count due to treating an NDV of 0 as a "true 0 value". Too small NDV values can lead to severe underestimation of GROUP BY cardinality; too large NDVs can severely underestimate the cardinality of an IN() filter. When dealing with very large tables (millions to billions of records), it can become prohibitively expensive to accurately track the number of unique values per column. This really makes missing (unknown) NDV a regular occurrence. For every attempt to estimate NDV using numRows, I quickly discovered a counterexample of a skewed dataset that led to severe misestimation, causing either a performance problem or an outright query failure. I would be happy to provide those examples if you like. In some of those cases, a better NDV estimation is possible. For example, with a CASE..WHEN statement with multiple constants. In many other cases, it is genuinely safer to use the most pessimistic algorithm, which handles "unknown NDV" columns surprisingly well. The very last commit has passing tests. I would appreciate it if you could take another look when you get a chance. The combiner logic is very close to ``` if (stat.getCountDistint() >= 0 && result.getCountDistint() >= 0) { result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), stat.getCountDistint())); } ``` except specific handling NDV==0 for non-constant null columns cc @okumin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
