neilconway opened a new pull request, #22718:
URL: https://github.com/apache/datafusion/pull/22718

   ## Which issue does this PR close?
   
   - Closes #22716 
   
   ## Rationale for this change
   
   #21081 capped the NDV at the row count when computing statistics for several 
operators. This PR extends that work and ensures that per-column statistics for 
filter operators are consistent with the estimated output row count. In 
particular:
   
   * Null count is also capped at the row count
   * Byte size is scaled down by the estimated selectivity
   
   We also extend the analysis to consider null-rejecting predicates; for 
example, the clause `a = 10` as a top-level conjunct implies that the 
null-count of the surviving rows is exactly 0.
   
   ## What changes are included in this PR?
   
   * Ensure per-column statistics (null count, byte size) are consistent with 
filtered row count
   * Check for null-rejecting predicates to estimate a more accurate null count 
of 0
   * Update SLT expected plans
   * Add unit tests for new behavior
   * Various refactoring and comment improvements
   
   ## Are these changes tested?
   
   Yes; new tests added.
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to