berkaysynnada commented on PR #14074:
URL: https://github.com/apache/datafusion/pull/14074#issuecomment-2594728046

   > Looks like I got hit by some new ColumnStatistics tests on main. Should be 
fixed now 🤞
   > 
   > @berkaysynnada can you expand on the rationale for the V2 stats? I 
understand that it's more expressive, but I can't see in the PR or Notion how 
those distributions might actually be used? Is this for join planning?
   > 
   > My understanding is I would no longer define a "min" or a "max" for a 
column. But there doesn't seem to be a place to define null count or sum?
   
   You can still define min or max. We are not replacing Statistics with 
Statistics_v2; it is actually replacing the Precision and Interval objects. We 
plan to rename the API of the execution plan from `fn statistics(&self) -> 
Statistics` to `fn statistics(&self) -> TableStatistics`, which is still 
structured as:
   ```
   pub struct TableStatistics {
       pub num_rows: Statistics,
       pub total_byte_size: Statistics,
       pub column_statistics: Vec<ColumnStatistics>,
   }
   ```
   and
   ```
   pub struct ColumnStatistics {
       pub null_count: Statistics,
       pub max_value: Statistics,
       pub min_value: Statistics,
       pub distinct_count: Statistics,
   }
   ```
   
   What we are trying to address is how the current way of indeterminate 
quantities are handled in a target-dependent way. For example, if there’s a 
possibility of indeterminate statistics, it is stored as the mean value when 
the caller requires an estimate. However, if bounds are required, that 
indeterminism is stored as an interval.
   
   Our goal is to consolidate all forms of indeterminism and structure them 
with a strong mathematical foundation. This way, every user can utilize the 
statistics in their intended way. We aim to preserve and sustain all possible 
helpful quantities wherever feasible.
   
   We are also constructing a robust evaluation and back-propagation mechanism 
(similar to interval arithmetic, evaluate_bounds, and propagate_constraints). 
With this mechanism, any kind of expression—whether projection-based 
(evaluation only) or filter-based (evaluation followed by propagation)—will 
automatically resolve using the new statistics.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to