AdamGS opened a new pull request, #22462:
URL: https://github.com/apache/datafusion/pull/22462

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   The current stats aggregation does a bunch of unnecessary work, this PR 
tries to do the minimal amount of work at every step.
   
   ## What changes are included in this PR?
   
   In addition to splitting up the summarization logic into some clearer 
functions and a reusable function for min/max, I've tried to do the minimal 
amount of work at each step:
   1. Only allocate boolean masks if there's a mix of exact/inexact stats 
between row groups.
   2. No need to allocate an Arrow array for null count.
   3. No need to re-calculate the parquet column index - its already in 
`stats_converter`, as far as I can tell its exactly the same code path.
   4. No need to recalculate the number of rows - we already know it.
   
   I've also included a benchmark, the effect on my laptop is:
   ```
   parquet_metadata_statistics/wide_one_row_group
                           time:   [2.9945 ms 3.0313 ms 3.0487 ms]
                           change: [−44.473% −43.790% −43.044%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Benchmarking parquet_metadata_statistics/moderate_width_many_row_groups: 
Collecting 10 samples in estimated 5
   parquet_metadata_statistics/moderate_width_many_row_groups
                           time:   [236.75 µs 237.37 µs 238.48 µs]
                           change: [−22.330% −21.550% −20.794%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 10 measurements (10.00%)
     1 (10.00%) high severe
   Benchmarking parquet_metadata_statistics/wide_many_row_groups: Collecting 10 
samples in estimated 5.0127 s (7
   parquet_metadata_statistics/wide_many_row_groups
                           time:   [628.67 µs 636.88 µs 645.79 µs]
                           change: [−29.409% −28.225% −26.999%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 1 outliers among 10 measurements (10.00%)
     1 (10.00%) high mild
   ```
   
   ## Are these changes tested?
   
   Existing tests and few additional small unit tests. 
   
   ## Are there any user-facing changes?
   
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to