gatesn commented on PR #13736: URL: https://github.com/apache/datafusion/pull/13736#issuecomment-2544064404
> Sum statistics are not available in parquet, but I can easily see other file formats providing them so I think this is a good addition to DataFusion. Aha, I might know one of those file formats! Here's the click bench diff (give or take quite a bit of noise): https://github.com/spiraldb/vortex/actions/runs/12297901942#summary-34320120073 Easy to see that Q2 and Q29 drop to constant time, and I think a few of the other queries benefit from AVG aggregations. > I really like that this PR integrates nicely with the existing [AggregateStatistics](https://docs.rs/datafusion-physical-optimizer/43.0.0/src/datafusion_physical_optimizer/aggregate_statistics.rs.html#35) pass / value_from_statistics pass I had actually considered other optimizer rules that might enable this, for example rewriting `SUM(X + 1)` to `SUM(X) + SUM(1)` with the latter sum turning into a constant evaluation. Unfortunately, rewriting the query in this way can significantly change overflow semantics, particularly with signed integers. It seemed best to preserve the original expression, attempt to evaluate over stats, and fallback to the original expression if the stats optimizations overflow. > The effect of making ColumnStatistics even larger (each ScalarValue is already quite large I think so adding another potential field may make statistics management even worse. Again, maybe this will be handled by the revamp that @suremarc is looking into We ran into similar performance issues in Vortex in fact. We've found that actually having a `Vec<(Stat, Value)>` or `Arc<[(Stat, Value)]>` performs quite well since we now have ~10 different statistics, albeit sparsely populated. (This reminds me, a sum statistic also doubles as a true/false count for boolean arrays. I imagine there are more optimizations in DataFusion that could benefit from a pre-computed true/false count) Is having custom statistics something that DataFusion might support? For example, I could declare a custom statistic along with a custom optimizer rule that makes use of it. I can also see the opposite argument that if a statistic is in any way useful, then DataFusion should add support for it internally, and therefore it doesn't need extensible stats. >In terms of testing, I think we should create a "end to end" type test -- that shows registering a TableProvider that can provide sum statistics that the optimizer uses to optimize away the actual aggregates. I will add one of these! > I will go spend some time trying to writeup what I think the consensus for Statistics is. Do you consider this to be blocking for this PR? Or is expanding the size of ColumnStatistics acceptable in the short-term? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
