gatesn commented on PR #13736:
URL: https://github.com/apache/datafusion/pull/13736#issuecomment-2544064404

   > Sum statistics are not available in parquet, but I can easily see other 
file formats providing them so I think this is a good addition to DataFusion.
   
   Aha, I might know one of those file formats! Here's the click bench diff 
(give or take quite a bit of noise): 
https://github.com/spiraldb/vortex/actions/runs/12297901942#summary-34320120073
   
   Easy to see that Q2 and Q29 drop to constant time, and I think a few of the 
other queries benefit from AVG aggregations.
   
   > I really like that this PR integrates nicely with the existing  
[AggregateStatistics](https://docs.rs/datafusion-physical-optimizer/43.0.0/src/datafusion_physical_optimizer/aggregate_statistics.rs.html#35)
 pass / value_from_statistics pass
   
   I had actually considered other optimizer rules that might enable this, for 
example rewriting `SUM(X + 1)` to `SUM(X) + SUM(1)` with the latter sum turning 
into a constant evaluation. Unfortunately, rewriting the query in this way can 
significantly change overflow semantics, particularly with signed integers. It 
seemed best to preserve the original expression, attempt to evaluate over 
stats, and fallback to the original expression if the stats optimizations 
overflow.
   
   > The effect of making ColumnStatistics even larger (each ScalarValue is 
already quite large I think so adding another potential field may make 
statistics management even worse. Again, maybe this will be handled by the 
revamp that @suremarc is looking into
   
   We ran into similar performance issues in Vortex in fact. We've found that 
actually having a `Vec<(Stat, Value)>` or `Arc<[(Stat, Value)]>` performs quite 
well since we now have ~10 different statistics, albeit sparsely populated.
   
   (This reminds me, a sum statistic also doubles as a true/false count for 
boolean arrays. I imagine there are more optimizations in DataFusion that could 
benefit from a pre-computed true/false count)
   
   Is having custom statistics something that DataFusion might support? For 
example, I could declare a custom statistic along with a custom optimizer rule 
that makes use of it. I can also see the opposite argument that if a statistic 
is in any way useful, then DataFusion should add support for it internally, and 
therefore it doesn't need extensible stats.
   
   >In terms of testing, I think we should create a "end to end" type test -- 
that shows registering a TableProvider that can provide sum statistics that the 
optimizer uses to optimize away the actual aggregates.
   
   I will add one of these!
   
   > I will go spend some time trying to writeup what I think the consensus for 
Statistics is.
   
   Do you consider this to be blocking for this PR? Or is expanding the size of 
ColumnStatistics acceptable in the short-term?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to