Tamar-Posen opened a new pull request, #18885:
URL: https://github.com/apache/datafusion/pull/18885

   Previously, AggregateExec dropped total_byte_size statistics 
(Precision::Absent) through aggregation operations, preventing the optimizer 
from making informed decisions about memory allocation and execution 
strategies(join side selection -> dynamic filters).
   
   This commit implements proportional byte-size scaling based on row count 
ratios:
   - Added calculate_scaled_byte_size helper with inline optimization
   - Scales byte size for Final/FinalPartitioned without GROUP BY
   - Scales byte size proportionally for all other aggregation modes
   - Always returns Precision::Inexact for estimates (semantically correct)
   - Returns Precision::Absent when insufficient input statistics
   
   Added test coverage for edge cases (absent statistics, zero rows).
   
   ## Which issue does this PR close?
   https://github.com/apache/datafusion/issues/18850
   
   - Closes #18850
   
   ## Rationale for this change
   Without byte-size statistics, the optimizer cannot estimate memory 
requirements for join-side selection, dynamic filter generation, and memory 
allocation decisions. This preserves statistics using proportional scaling 
(bytes_per_row × output_rows).
   
   ## What changes are included in this PR?
   1. Modified `statistics_inner` to calculate proportional byte size instead 
of returning `Precision::Absent`
   2. Added `calculate_scaled_byte_size` helper (inline optimized, guards 
against division by zero)
   3. Updated test assertions and added edge case coverage
   
   ## Are these changes tested?
   Yes:
   - Modified `check_aggregates` validates statistics preservation through 
aggregation pipeline
   - New `test_aggregate_statistics_edge_cases` covers edge cases scenarios
   
   ## Are there any user-facing changes?
   No breaking changes. 
   Internal optimization that may improve query planning and provide more 
accurate memory estimates in EXPLAIN output.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to