[PR] Add statistics integration tests [datafusion]

via GitHub Wed, 11 Feb 2026 07:44:49 -0800


gabotechs opened a new pull request, #20292:
URL: https://github.com/apache/datafusion/pull/20292


   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   TODO: create an epic for stats estimation improvements
   
   - Closes #.
   
   ## Rationale for this change
   
   The statistics propagation system is lacking some overall accuracy. There 
should be many low hanging fruits regarding statistic propagation that if 
addressed could increase the quality of the physical plans as a results of 
better estimations.
   
   The idea behind this PR is to add some tests for qualifying further 
improvements in the statistic propagation system, so that people can 
incrementally improve it over time.
   
   ## What changes are included in this PR?
   
   Adds some new integration tests that compare the results estimated through 
statistics vs what actually happened within the query for the TPC-DS dataset 
(just that one for now):
   1. It calls `ExecutePlan::partition_statistics()` on all the nodes, 
collecting their estimations.
   2. It executes the plan against actual parquet files, collecting the 
relevant metrics.
   3. It computes an accuracy factor based on what was estimated VS what 
actually happen
   4. It commits the avg accuracy into insta snapshots in the tests, so that 
further contributions can show some improvements in the overall accuracy number.
   
   It also adds a `Debug` implementation that allows visualizing the comparison 
in a fine-grained way:
   
   ```
   SortPreservingMergeExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
     SortExec(TopK): output_bytes=Absent vs 0 (0%) output_bytes=Inexact(1) vs 0 
(0%)
       ProjectionExec: output_bytes=Absent vs 0 (0%) output_bytes=Inexact(1) vs 
0 (0%)
         AggregateExec: output_bytes=Absent vs 0 (0%) output_bytes=Inexact(1) 
vs 0 (0%)
           RepartitionExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
             AggregateExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
               RepartitionExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
                 HashJoinExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
                   HashJoinExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
                     HashJoinExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
                       CoalescePartitionsExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
                         HashJoinExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
                           HashJoinExec: output_bytes=Absent vs 0 (0%) 
output_bytes=Inexact(1) vs 0 (0%)
                             CoalescePartitionsExec: output_bytes=Inexact(12) 
vs 32832 (1%) output_bytes=Inexact(1) vs 1 (100%)
                               ProjectionExec: output_bytes=Inexact(12) vs 
32832 (1%) output_bytes=Inexact(1) vs 1 (100%)
                                 FilterExec: output_bytes=Inexact(8) vs 32768 
(1%) output_bytes=Inexact(1) vs 1 (100%)
                                   RepartitionExec: output_bytes=Inexact(12000) 
vs 98304 (13%) output_bytes=Inexact(1000) vs 63 (7%)
                                     DataSourceExec: 
output_bytes=Inexact(12000) vs 756 (7%) output_bytes=Inexact(1000) vs 63 (7%)
                             DataSourceExec: output_bytes=Inexact(44000) vs 
67520 (66%) output_bytes=Inexact(1000) vs 1000 (100%)
                           FilterExec: output_bytes=Absent vs 442368 (0%) 
output_bytes=Inexact(1) vs 21 (5%)
                             RepartitionExec: output_bytes=Absent vs 525095 
(0%) output_bytes=Inexact(1000) vs 1000 (100%)
                               DataSourceExec: output_bytes=Absent vs 96717 
(0%) output_bytes=Inexact(1000) vs 1000 (100%)
                       DataSourceExec: output_bytes=Inexact(16000) vs 16064 
(100%) output_bytes=Inexact(1000) vs 1000 (100%)
                     DataSourceExec: output_bytes=Absent vs 44589 (0%) 
output_bytes=Inexact(1000) vs 1000 (100%)
                   DataSourceExec: output_bytes=Absent vs 450 (0%) 
output_bytes=Inexact(12) vs 12 (100%)
   ```
   
   ## Are these changes tested?
   
   This changes are exclusively new integration tests
   
   ## Are there any user-facing changes?
   
   No
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add statistics integration tests [datafusion]

Reply via email to