berkaysynnada commented on PR #14074: URL: https://github.com/apache/datafusion/pull/14074#issuecomment-2594728046
> Looks like I got hit by some new ColumnStatistics tests on main. Should be fixed now 🤞 > > @berkaysynnada can you expand on the rationale for the V2 stats? I understand that it's more expressive, but I can't see in the PR or Notion how those distributions might actually be used? Is this for join planning? > > My understanding is I would no longer define a "min" or a "max" for a column. But there doesn't seem to be a place to define null count or sum? You can still define min or max. We are not replacing Statistics with Statistics_v2; it is actually replacing the Precision and Interval objects. We plan to rename the API of the execution plan from `fn statistics(&self) -> Statistics` to `fn statistics(&self) -> TableStatistics`, which is still structured as: ``` pub struct TableStatistics { pub num_rows: Statistics, pub total_byte_size: Statistics, pub column_statistics: Vec<ColumnStatistics>, } ``` and ``` pub struct ColumnStatistics { pub null_count: Statistics, pub max_value: Statistics, pub min_value: Statistics, pub distinct_count: Statistics, } ``` What we are trying to address is how the current way of indeterminate quantities are handled in a target-dependent way. For example, if there’s a possibility of indeterminate statistics, it is stored as the mean value when the caller requires an estimate. However, if bounds are required, that indeterminism is stored as an interval. Our goal is to consolidate all forms of indeterminism and structure them with a strong mathematical foundation. This way, every user can utilize the statistics in their intended way. We aim to preserve and sustain all possible helpful quantities wherever feasible. We are also constructing a robust evaluation and back-propagation mechanism (similar to interval arithmetic, evaluate_bounds, and propagate_constraints). With this mechanism, any kind of expression—whether projection-based (evaluation only) or filter-based (evaluation followed by propagation)—will automatically resolve using the new statistics. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org