[PR] Improve parquet gzip compression performance using zlib-rs [arrow-rs]

via GitHub Wed, 26 Feb 2025 09:03:03 -0800


psvri opened a new pull request, #7200:
URL: https://github.com/apache/arrow-rs/pull/7200


   # Which issue does this PR close?
   
   Closes #.
   
   # Rationale for this change
    
   We will use zlib-rs for delfate operations which has much better performance 
than the current one. I can see ~10%-47% performance improvement in various 
scenarios
   
   <details>
   <summary>perf numbers</summary>
   
   ```
   
   Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100 
samples in estimated 5.0406 s (200 itercompress GZIP(GzipLevel(6)) - 
alphanumeric
                           time:   [24.395 ms 24.934 ms 25.612 ms]
                           change: [-33.807% -31.734% -29.276%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     1 (1.00%) high mild
     3 (3.00%) high severe
   
   GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
   Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric: Collecting 100 
samples in estimated 5.0748 s (1500 idecompress GZIP(GzipLevel(6)) - 
alphanumeric
                           time:   [3.1176 ms 3.1698 ms 3.2359 ms]
                           change: [-17.565% -14.155% -10.959%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 10 outliers among 100 measurements (10.00%)
     6 (6.00%) high mild
     4 (4.00%) high severe
   
   LZ4 compressed 1048576 bytes of alphanumeric to 1052698 bytes
   LZ4_RAW compressed 1048576 bytes of alphanumeric to 1052690 bytes
   SNAPPY compressed 1048576 bytes of alphanumeric to 1048627 bytes
   Benchmarking compress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100 
samples in estimated 7.2246 s (300 icompress GZIP(GzipLevel(6)) - alphanumeric 
#2
                           time:   [23.604 ms 24.208 ms 25.049 ms]
                           change: [-35.876% -33.572% -30.751%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     1 (1.00%) high mild
     3 (3.00%) high severe
   
   GZIP(GzipLevel(6)) compressed 1048576 bytes of alphanumeric to 785084 bytes
   Benchmarking decompress GZIP(GzipLevel(6)) - alphanumeric #2: Collecting 100 
samples in estimated 5.2412 s (160decompress GZIP(GzipLevel(6)) - alphanumeric 
#2
                           time:   [3.1750 ms 3.2293 ms 3.2959 ms]
                           change: [-11.983% -9.8119% -7.4916%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 4 outliers among 100 measurements (4.00%)
     2 (2.00%) high mild
     2 (2.00%) high severe
   
   ZSTD(ZstdLevel(1)) compressed 1048576 bytes of alphanumeric to 782315 bytes
   BROTLI(BrotliLevel(1)) compressed 1048576 bytes of words to 280547 bytes
   compress GZIP(GzipLevel(6)) - words
                           time:   [25.177 ms 25.845 ms 26.642 ms]
                           change: [-43.454% -41.459% -39.296%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     1 (1.00%) high mild
     5 (5.00%) high severe
   
   GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
   Benchmarking decompress GZIP(GzipLevel(6)) - words: Warming up for 3.0000 s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 8.6s, enable flat sampling, or reduce sample count to 50.
   Benchmarking decompress GZIP(GzipLevel(6)) - words: Collecting 100 samples 
in estimated 8.5642 s (5050 iteratiodecompress GZIP(GzipLevel(6)) - words
                           time:   [1.6287 ms 1.6679 ms 1.7235 ms]
                           change: [-48.700% -47.429% -46.180%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 3 outliers among 100 measurements (3.00%)
     1 (1.00%) high mild
     2 (2.00%) high severe
   
   LZ4 compressed 1048576 bytes of words to 408369 bytes
   LZ4_RAW compressed 1048576 bytes of words to 408361 bytes
   SNAPPY compressed 1048576 bytes of words to 347626 bytes
   Benchmarking compress GZIP(GzipLevel(6)) - words #2: Collecting 100 samples 
in estimated 5.3460 s (200 iteratiocompress GZIP(GzipLevel(6)) - words #2
                           time:   [24.671 ms 25.105 ms 25.659 ms]
                           change: [-45.037% -43.251% -41.466%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 17 outliers among 100 measurements (17.00%)
     11 (11.00%) high mild
     6 (6.00%) high severe
   
   GZIP(GzipLevel(6)) compressed 1048576 bytes of words to 236887 bytes
   Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Warming up for 3.0000 
s
   Warning: Unable to complete 100 samples in 5.0s. You may wish to increase 
target time to 8.5s, enable flat sampling, or reduce sample count to 50.
   Benchmarking decompress GZIP(GzipLevel(6)) - words #2: Collecting 100 
samples in estimated 8.4538 s (5050 iteradecompress GZIP(GzipLevel(6)) - words 
#2
                           time:   [1.6321 ms 1.6643 ms 1.7057 ms]
                           change: [-49.124% -47.828% -46.303%] (p = 0.00 < 
0.05)
                           Performance has improved.
   Found 6 outliers among 100 measurements (6.00%)
     3 (3.00%) high mild
     3 (3.00%) high severe
   
   ZSTD(ZstdLevel(1)) compressed 1048576 bytes of words to 272814 bytes
   
   ```
   </details>
   
   # What changes are included in this PR?
   
   I have updated the flate library to use zlib-rs backend. This does mean that 
we need to bump our MSRV to 1.75 . So I dont expect the PR to merged 
immediately until we resolve https://github.com/apache/arrow-rs/issues/181
   
   Also we allow gzip level 10 in our parquet implementation , which is non 
complaint gzip level as explained 
[here](https://docs.rs/flate2/latest/flate2/struct.Compression.html#method.new) 
. Hence I have also changed max gzip level to 9. 
   
   # Are there any user-facing changes?
   
   Yes, max gzip level is now 9 in parquet. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Improve parquet gzip compression performance using zlib-rs [arrow-rs]

Reply via email to