AdamGS opened a new pull request, #22462:
URL: https://github.com/apache/datafusion/pull/22462
## Which issue does this PR close?
- Closes #.
## Rationale for this change
The current stats aggregation does a bunch of unnecessary work, this PR
tries to do the minimal amount of work at every step.
## What changes are included in this PR?
In addition to splitting up the summarization logic into some clearer
functions and a reusable function for min/max, I've tried to do the minimal
amount of work at each step:
1. Only allocate boolean masks if there's a mix of exact/inexact stats
between row groups.
2. No need to allocate an Arrow array for null count.
3. No need to re-calculate the parquet column index - its already in
`stats_converter`, as far as I can tell its exactly the same code path.
4. No need to recalculate the number of rows - we already know it.
I've also included a benchmark, the effect on my laptop is:
```
parquet_metadata_statistics/wide_one_row_group
time: [2.9945 ms 3.0313 ms 3.0487 ms]
change: [−44.473% −43.790% −43.044%] (p = 0.00 <
0.05)
Performance has improved.
Benchmarking parquet_metadata_statistics/moderate_width_many_row_groups:
Collecting 10 samples in estimated 5
parquet_metadata_statistics/moderate_width_many_row_groups
time: [236.75 µs 237.37 µs 238.48 µs]
change: [−22.330% −21.550% −20.794%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high severe
Benchmarking parquet_metadata_statistics/wide_many_row_groups: Collecting 10
samples in estimated 5.0127 s (7
parquet_metadata_statistics/wide_many_row_groups
time: [628.67 µs 636.88 µs 645.79 µs]
change: [−29.409% −28.225% −26.999%] (p = 0.00 <
0.05)
Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
```
## Are these changes tested?
Existing tests and few additional small unit tests.
## Are there any user-facing changes?
None
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]