konstantinb commented on PR #6359:
URL: https://github.com/apache/hive/pull/6359#issuecomment-4158960930

   > From my perspective missing NDVs is an edge case and not something that 
should appear too often during optimization.
   
   @zabetak, over the last few months of struggling with NDVs, I have 
identified various ways to trigger bad estimates with the current code. The 
following [estimation change in an .out 
file](https://github.com/apache/hive/pull/6359#discussion_r3012726545) 
immediately highlights a 35x underestimation of row count due to treating an 
NDV of 0 as a "true 0 value".
   
   Too small NDV values can lead to severe underestimation of GROUP BY 
cardinality; too large NDVs can severely underestimate the cardinality of an 
IN() filter.
   
   When dealing with very large tables (millions to billions of records), it 
can become prohibitively expensive to accurately track the number of unique 
values per column. This really makes missing (unknown) NDV a regular occurrence.
   
   For every attempt to estimate NDV using numRows, I quickly discovered a 
counterexample of a skewed dataset that led to severe misestimation, causing 
either a performance problem or an outright query failure. I would be happy to 
provide those examples if you like.
   
   In some of those cases, a better NDV estimation is possible. For example, 
with a CASE..WHEN statement with multiple constants. In many other cases, it is 
genuinely safer to use the most pessimistic algorithm, which handles "unknown 
NDV" columns surprisingly well.
   
   The very last commit has passing tests. I would appreciate it if you could 
take another look when you get a chance.
   
   The combiner logic is very close to
   ```
       if (stat.getCountDistint() >= 0 && result.getCountDistint() >= 0) {
         result.setCountDistint(StatsUtils.safeAdd(result.getCountDistint(), 
stat.getCountDistint()));
       }
   
   ```
   except specific handling NDV==0 for non-constant null columns
   
   cc @okumin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to