psvri opened a new pull request, #7200:
URL: https://github.com/apache/arrow-rs/pull/7200
# Which issue does this PR close?
Closes #.
# Rationale for this change
We will use zlib-rs for delfate operations which has much better performance
than the current one. I can see ~10%-47% performance improvement in various
scenarios
<details>
<summary>perf numbers</summary>
```
Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100
samples in estimated 5.0406 s (200 itercompress GZIP(GzipLevel(6)) -
alphanumeric
time: [24.395 ms 24.934 ms 25.612 ms]
change: [-33.807% -31.734% -29.276%] (p = 0.00 <
0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100
samples in estimated 5.0748 s (1500 idecompress GZIP(GzipLevel(6)) -
alphanumeric
time: [3.1176 ms 3.1698 ms 3.2359 ms]
change: [-17.565% -14.155% -10.959%] (p = 0.00 <
0.05)
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
LZ4 compressed 1048576 bytes of alphanumeric to 1052698 bytes
LZ4_RAW compressed 1048576 bytes of alphanumeric to 1052690 bytes
SNAPPY compressed 1048576 bytes of alphanumeric to 1048627 bytes
Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100
samples in estimated 7.2246 s (300 icompress GZIP(GzipLevel(6)) - alphanumeric
#2
time: [23.604 ms 24.208 ms 25.049 ms]
change: [-35.876% -33.572% -30.751%] (p = 0.00 <
0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100
samples in estimated 5.2412 s (160decompress GZIP(GzipLevel(6)) - alphanumeric
#2
time: [3.1750 ms 3.2293 ms 3.2959 ms]
change: [-11.983% -9.8119% -7.4916%] (p = 0.00 <
0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
ZSTD(ZstdLevel(1)) compressed 1048576 bytes of alphanumeric to 782315 bytes
BROTLI(BrotliLevel(1)) compressed 1048576 bytes of words to 280547 bytes
compress GZIP(GzipLevel(6)) - words
time: [25.177 ms 25.845 ms 26.642 ms]
change: [-43.454% -41.459% -39.296%] (p = 0.00 <
0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) high mild
5 (5.00%) high severe
GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - words: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 8.6s, enable flat sampling, or reduce sample count to 50.
Benchmarking decompress GZIP(GzipLevel(6)) - words: Collecting 100 samples
in estimated 8.5642 s (5050 iteratiodecompress GZIP(GzipLevel(6)) - words
time: [1.6287 ms 1.6679 ms 1.7235 ms]
change: [-48.700% -47.429% -46.180%] (p = 0.00 <
0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
LZ4 compressed 1048576 bytes of words to 408369 bytes
LZ4_RAW compressed 1048576 bytes of words to 408361 bytes
SNAPPY compressed 1048576 bytes of words to 347626 bytes
Benchmarking compress GZIP(GzipLevel(6)) - words #2: Collecting 100 samples
in estimated 5.3460 s (200 iteratiocompress GZIP(GzipLevel(6)) - words #2
time: [24.671 ms 25.105 ms 25.659 ms]
change: [-45.037% -43.251% -41.466%] (p = 0.00 <
0.05)
Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
11 (11.00%) high mild
6 (6.00%) high severe
GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Warming up for 3.0000
s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 8.5s, enable flat sampling, or reduce sample count to 50.
Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Collecting 100
samples in estimated 8.4538 s (5050 iteradecompress GZIP(GzipLevel(6)) - words
#2
time: [1.6321 ms 1.6643 ms 1.7057 ms]
change: [-49.124% -47.828% -46.303%] (p = 0.00 <
0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
ZSTD(ZstdLevel(1)) compressed 1048576 bytes of words to 272814 bytes
```
</details>
# What changes are included in this PR?
I have updated the flate library to use zlib-rs backend. This does mean that
we need to bump our MSRV to 1.75 . So I dont expect the PR to merged
immediately until we resolve https://github.com/apache/arrow-rs/issues/181
Also we allow gzip level 10 in our parquet implementation , which is non
complaint gzip level as explained
[here](https://docs.rs/flate2/latest/flate2/struct.Compression.html#method.new)
. Hence I have also changed max gzip level to 9.
# Are there any user-facing changes?
Yes, max gzip level is now 9 in parquet.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]