Hi everyone,
I ran some tests using 4 of our large tables to compare compression codecs.
I tested gzip, brotli, lz4, and zstd, all with the default configuration.
You can find the raw data and summary tables/graphs in this spreadsheet:
https://docs.google.com/spreadsheets/d/1MAPrKHJn1li4MEbtQ9-T1Myu-AI0AshTPSC6C0ttuIw/edit?usp=sharing
For the test, I used parquet-cli to convert Avro data to Parquet using each
compression format. Run times come from the `time` utility, so this is an
end-to-end test, not just time spent in the compression algorithm. Still,
the overhead was the same across all runs for a given table.
I ran the tests on my laptop. I had more physical memory available than the
maximum size of the JVM, so I don't think paging was an issue. Data was
read from and written to my local SSD. I wrote an output file for each
compression codec and table 5 times.
I'm also attaching some sanitized summary information for a row group in
each table.
Everyone should be able to comment on the results using that link.
rb
--
Ryan Blue
Software Engineer
Netflix
Table One:
Row group 0: count: 422735 312.85 B records start: 4 total: 126.125 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls
other_properties.map.key BINARY G _ R 17326411 0.09 B 0
other_properties.map.value BINARY G _ R_ F 17326411 7.37 B 0
event_utc_ms INT64 G _ R 422735 1.44 B 0
hostname BINARY G _ R 422735 0.09 B 0
another_map.map.key BINARY G _ R 813517 0.03 B 9
another_map.map.value BINARY G _ R_ F 813517 2.81 B 9
Table Three:
Row group 0: count: 2153 59.883 kB records start: 4 total: 125.907 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls
other_properties.map.key BINARY G _ R 47366 0.01 B 0
other_properties.map.value BINARY G _ R_ F 47366 4.83 B 0
event_utc_ms INT64 G _ 2153 2.08 B 0
hostname BINARY G _ R 2153 1.50 B 0
another_map.map.key BINARY G _ 2153 0.02 B 2153
another_map.map.value BINARY G _ 2153 0.02 B 2153
column 1 INT64 G _ 2153 0.02 B 2153
column 2 INT64 G _ 2153 5.71 B 0
column 3 BINARY G _ R 2153 0.80 B 0
column 4 BINARY G _ 2153 0.02 B 2153
column 5 INT64 G _ 2153 2.67 B 0
column 6 BINARY G _ 2153 0.02 B 2153
column 7 BINARY G _ R 2153 0.05 B 0
column 8 BINARY G _ 2153 0.02 B 2153
column 9 BINARY G _ 2153 0.02 B 2153
column 10 BINARY G _ R 2153 0.79 B 0
column 11 BINARY G _ 2153 15.600 kB 0
column 12 BINARY G _ 2153 18.22 B 0
column 13 INT32 G _ R 2153 0.13 B 0
column 14 BINARY G _ R 2153 0.18 B 0
column 15 BINARY G _ 2153 41.811 kB 0
column 16 BINARY G _ 2153 2.337 kB 0
Table Two:
Row group 0: count: 443955 278.25 B records start: 4 total: 117.809 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls
other_properties.map.key BINARY G _ R 11514004 0.10 B 0
other_properties.map.value BINARY G _ R_ F 11514004 5.99 B 0
event_utc_ms INT64 G _ R 443955 2.49 B 0
hostname BINARY G _ R 443955 0.78 B 0
column 1 BINARY G _ R 443955 0.10 B 266638
column 2 BINARY G _ R 443955 0.01 B 443194
column 3 BINARY G _ 443955 0.00 B 443955
column 4 BINARY G _ R 443955 0.00 B 443194
column 5 BINARY G _ R 443955 0.69 B 267399
column 6 BINARY G _ R 443955 0.18 B 266638
column 7 BINARY G _ R 443955 0.13 B 43730
column 8 BINARY G _ R 443955 0.43 B 43730
column 9 BINARY G _ 443955 0.00 B 443955
column 10 BINARY G _ 443955 0.00 B 443955
column 11 BINARY G _ R 443955 0.54 B 44491
column 12 BINARY G _ R 443955 0.15 B 44491
column 13 BINARY G _ R 443955 0.06 B 43477
column 14 BOOLEAN G _ 443955 0.05 B 0
column 15 BINARY G _ 443955 7.33 B 260070
column 16 BINARY G _ R_ F 443955 34.87 B 43477
column 17 BINARY G _ R 443955 0.06 B 43477
column 18 BINARY G _ R 443955 1.13 B 219166
column 19 BINARY G _ R 443955 0.10 B 44238
column 20 BINARY G _ R 443955 0.21 B 43477
column 21 BINARY G _ 443955 13.76 B 15086
column 22 BINARY G _ R 443955 0.21 B 43477
column 23 INT64 G _ R 443955 0.18 B 425748
column 24 INT64 G _ 443955 0.16 B 425755
column 25 INT64 G _ R 443955 0.06 B 44238
column 26 INT64 G _ R 443955 0.06 B 44238
column 27 INT64 G _ R 443955 2.55 B 59
column 28 INT32 G _ R 443955 0.01 B 443194
column 29 INT32 G _ R 443955 1.23 B 59
another_map.map.key BINARY G _ R 2029898 0.15 B 16808
another_map.map.value BINARY G _ R_ F 2029898 11.39 B 16808
Table Four:
Row group 0: count: 370359 329.01 B records start: 4 total: 116.207 MB
--------------------------------------------------------------------------------
type encodings count avg size nulls
other_properties.map.key BINARY G _ R 12628648 0.24 B 0
other_properties.map.value BINARY G _ R_ F 12628648 8.64 B 0
event_utc_ms INT64 G _ R 370359 0.91 B 0
hostname BINARY G _ R 370359 0.13 B 0
column 1 BINARY G _ 370359 0.00 B 370359
column 2 INT32 G _ R 370359 0.40 B 0
column 3 INT64 G _ R 370359 1.23 B 0
column 4 INT64 G _ R 370359 1.01 B 238642
column 5 BINARY G _ R 370359 0.09 B 0
column 6 BINARY G _ R 370359 0.48 B 0
column 7 BINARY G _ 370359 0.00 B 370359
column 8 BINARY G _ R 370359 0.04 B 0
column 9 BINARY G _ R 370359 0.01 B 0
column 10 BINARY G _ R 370359 0.14 B 0
column 11 BINARY G _ R 370359 0.00 B 0
column 12 BINARY G _ R 370359 0.52 B 0
column 13 BINARY G _ R 370359 0.06 B 0
column 14 BINARY G _ R 370359 0.42 B 0
column 15 BINARY G _ R 370359 0.21 B 0
column 16 BINARY G _ R 370359 0.11 B 0
column 17 BINARY G _ R 370359 1.39 B 0
column 18 BINARY G _ R 370359 1.38 B 0
column 19 INT64 G _ R 370359 0.30 B 328724
column 20 INT32 G _ R 370359 0.11 B 0
column 21 INT32 G _ R 370359 0.40 B 0
column 22 BINARY G _ R_ F 370359 10.48 B 0
column 23 BINARY G _ R 370359 0.10 B 0
column 24 BINARY G _ R 370359 0.00 B 0
column 25 BINARY G _ R 370359 0.00 B 0
another_map.map.key BINARY G _ R 709939 0.06 B 0
another_map.map.value BINARY G _ R_ F 709939 3.18 B 0