Re: [PR] Add sum statistics and PhysicalExpr::column_statistics [datafusion]

via GitHub Sun, 15 Dec 2024 13:01:34 -0800


gatesn commented on PR #13736:
URL: https://github.com/apache/datafusion/pull/13736#issuecomment-2544064404

> Sum statistics are not available in parquet, but I can easily see other
file formats providing them so I think this is a good addition to DataFusion.

Aha, I might know one of those file formats! Here's the click bench diff
(give or take quite a bit of noise):
https://github.com/spiraldb/vortex/actions/runs/12297901942#summary-34320120073

Easy to see that Q2 and Q29 drop to constant time, and I think a few of the
other queries benefit from AVG aggregations.

> I really like that this PR integrates nicely with the existing
[AggregateStatistics](https://docs.rs/datafusion-physical-optimizer/43.0.0/src/datafusion_physical_optimizer/aggregate_statistics.rs.html#35)
pass / value_from_statistics pass

I had actually considered other optimizer rules that might enable this, for
example rewriting `SUM(X + 1)` to `SUM(X) + SUM(1)` with the latter sum turning
into a constant evaluation. Unfortunately, rewriting the query in this way can
significantly change overflow semantics, particularly with signed integers. It
seemed best to preserve the original expression, attempt to evaluate over
stats, and fallback to the original expression if the stats optimizations
overflow.

> The effect of making ColumnStatistics even larger (each ScalarValue is
already quite large I think so adding another potential field may make
statistics management even worse. Again, maybe this will be handled by the
revamp that @suremarc is looking into

We ran into similar performance issues in Vortex in fact. We've found that
actually having a `Vec<(Stat, Value)>` or `Arc<[(Stat, Value)]>` performs quite
well since we now have ~10 different statistics, albeit sparsely populated.

(This reminds me, a sum statistic also doubles as a true/false count for
boolean arrays. I imagine there are more optimizations in DataFusion that could
benefit from a pre-computed true/false count)

Is having custom statistics something that DataFusion might support? For
example, I could declare a custom statistic along with a custom optimizer rule
that makes use of it. I can also see the opposite argument that if a statistic
is in any way useful, then DataFusion should add support for it internally, and
therefore it doesn't need extensible stats.

>In terms of testing, I think we should create a "end to end" type test --
that shows registering a TableProvider that can provide sum statistics that the
optimizer uses to optimize away the actual aggregates.

I will add one of these!

> I will go spend some time trying to writeup what I think the consensus for
Statistics is.

Do you consider this to be blocking for this PR? Or is expanding the size of
ColumnStatistics acceptable in the short-term?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add sum statistics and PhysicalExpr::column_statistics [datafusion]

Reply via email to